[SOLVED] bash command or sript to list files

sepi · 08-25-2009, 02:26 PM

Hello,

need a command or script to
list all files recursive without directories
one line per file, no extra lines like ls -AR1
should print file size and name
eg.:
12 file.ext
25684 file2.ext
589 file3.ext
...

catkin · 08-25-2009, 02:28 PM

Is this homework? Do you have any ideas? Have you tried anything?

sepi · 08-25-2009, 02:33 PM

no, it is not homework

I have two volumes with mostly (but not exactly) same files, but completly different directory structures. Need to known wich files exist in only one volume, and wich are dups.

tried ls with awk but all print extra lines

catkin · 08-25-2009, 02:45 PM

How can you identify a file? Are names unique within each volume (file system?)? As in any /foo/bar and /goo/bar files? If so the what further characteristics, beyond the name, will be enough to uniquely identify a file -- size in bytes, modification time, checksum ... ?

Are you only dealing with "normal" files or do you have multipli-linked files, symlinks, device files, fifos ... ?

catkin · 08-25-2009, 02:47 PM

How many files, roughly, in total?

sepi · 08-25-2009, 03:06 PM

hi,
files should identify by name and size
need not special files, but volumes does not contain any spec files, just normals
it is about 500k files in 900 GBytes in each volume, i think about 450k files are identical
the directory structure is completly different

catkin · 08-25-2009, 03:47 PM

Ouch! That's big! Performance will be significant and bash string manipulation is slow but I can't think how to handle whitespace in file names using awk (I'm not very proficient in awk so that doesn't mean it can't be done -- it almost certainly can). How about this for starters?

Code:

#!/bin/bash
find . -type f -exec /bin/ls -l {} \; | while read x x x x size x x name
do
	echo $size "${name##*/}"
done

Maybe could be speeded up by using xargs on the find. The output will need sorting ...

sepi · 08-25-2009, 04:15 PM

Thank you very mouch!
The solution is exactly what i need.
Runtime is not a problem, granted one core for it, will continue in the background.
Sorting not necessary, the output will be imported into mysql, then some simple query should show the dups and diffs.
thx again!

PS:
runtime was about 45 min.
result is about 8 MB
fine