Need help using awk/sed to make this command work

blood_ocean · 09-02-2009, 11:07 AM

I'm fairly good at doing stuff in shell but I suck at using sed and awk, and I think I need somebody with more experience to show me the rest of the way to do this. See this command?

Code:

{ ( cd /media/disk-1; find ); ( cd /media/disk-2; find ) } | sort | uniq -d > filesystemcomparison

I want to use this on my ubuntu system to compare the contents of two external hard drives and come up with a list of which files are duplicates so that I can delete them. And it works, sort of. Problem is, uniq will only return results of duplicates that have the same directory structure. So it will show up some dupes, for example /media/disk-1/Music/blindmelon.allthatineed.mp3 and /media/disk-2/Music/blindmelon.allthatineed.mp3, but if one of those copies is in, say, the root of that filesystem 2 or really any other directory other than Music, it won't show up in the resulting output. I guess it appears to uniq as having a different absolute filename and uniq treats it as if it were not a duplicate when it actually is. I have a feeling that something like sed will be necessary to cut out the paths for the files before they go to uniq, but I'm not sure how to do it, and even if it were done it wouldn't be perfect since then I wouldn't have any indication where the dupes are in the filesystems since their paths would no longer be part of the resulting text file. So I'm a little stumped too, but a half working version is better than none unless somebody has any alternative method?

H_TeXMeX_H · 09-02-2009, 12:27 PM

Well, you can use 'basename' to get just the end file name, for example:

Code:

bash-3.1$ basename /usr/local/bin
bin
bash-3.1$ basename /usr/local /bin
local
bash-3.1$ basename "/usr/local /bin"
bin

blood_ocean · 09-02-2009, 01:21 PM

But I can't pipe that data to basename, can I?

TBC Cosmo · 09-02-2009, 01:26 PM

Use it like this:

Code:

{ ( cd /tmp/dir1; find -exec basename {} \; ); ( cd /tmp/dir2; find -exec basename {} \; ) } | sort | uniq -d > filesystemcomparison

Other half of solution is getting the paths to those dup files.

blood_ocean · 09-02-2009, 02:33 PM

Ah okay well this is great! Thanks for recommending basename tex, looks like it's working good, and thanks for the clarification on syntax with it cosmo, not a lot of documentation in the man files for this one.

So about the second part of this, the problem of how to get back the full pathnames for these filenames I have. Any thoughts? The command finished running after about 15 minutes and left me with a nice list of 20,000 files. I could do something like run locate on them, right?

Kenhelm · 09-02-2009, 10:26 PM

This method uses GNU sed.
It keeps the full pathnames.

Code:

echo '
/ghj/nm/cvb/file1
/ghj/nm/cvb/file2
/ghj/nm/cvb/file3
/ghj/nm/cvb/file4
/ghj/nm/cvb/file5
/ghj/nm/cvb/file6
/abc/file4
/abc/file3
/qwe/rty/file4
/xyz/file2
/jkl/file3
/mno/file4' |
rev| sort| sed -nr ':a N;s%(^[^/]*/).*\n\1.*%&%p;D;$!ba'| uniq| rev

/ghj/nm/cvb/file2
/xyz/file2
/ghj/nm/cvb/file3
/abc/file3
/jkl/file3
/ghj/nm/cvb/file4
/abc/file4
/mno/file4
/qwe/rty/file4

'rev' reverses the order of the characters in each file name
e.g. '/xyz/file2' becomes '2elif/zyx/'
Sorting the reversed file names brings together those which have the same basename.
The sed command uses a 'sliding window' of two lines in the pattern space.
Only adjacent lines with the same first field (delimited by '/') are printed.
If there are more than two files with the same basename this generates some duplicate lines which are removed with 'uniq'.

blood_ocean · 09-04-2009, 12:06 PM

Sweet, thanks a lot Kenhelm, I've got my pathnames now and everything, and the outputted file is fantastic, I'll be able to really trim down the filesystems now that I have this command.