ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm fairly good at doing stuff in shell but I suck at using sed and awk, and I think I need somebody with more experience to show me the rest of the way to do this. See this command?
Code:
{ ( cd /media/disk-1; find ); ( cd /media/disk-2; find ) } | sort | uniq -d > filesystemcomparison
I want to use this on my ubuntu system to compare the contents of two external hard drives and come up with a list of which files are duplicates so that I can delete them. And it works, sort of. Problem is, uniq will only return results of duplicates that have the same directory structure. So it will show up some dupes, for example /media/disk-1/Music/blindmelon.allthatineed.mp3 and /media/disk-2/Music/blindmelon.allthatineed.mp3, but if one of those copies is in, say, the root of that filesystem 2 or really any other directory other than Music, it won't show up in the resulting output. I guess it appears to uniq as having a different absolute filename and uniq treats it as if it were not a duplicate when it actually is. I have a feeling that something like sed will be necessary to cut out the paths for the files before they go to uniq, but I'm not sure how to do it, and even if it were done it wouldn't be perfect since then I wouldn't have any indication where the dupes are in the filesystems since their paths would no longer be part of the resulting text file. So I'm a little stumped too, but a half working version is better than none unless somebody has any alternative method?
Ah okay well this is great! Thanks for recommending basename tex, looks like it's working good, and thanks for the clarification on syntax with it cosmo, not a lot of documentation in the man files for this one.
So about the second part of this, the problem of how to get back the full pathnames for these filenames I have. Any thoughts? The command finished running after about 15 minutes and left me with a nice list of 20,000 files. I could do something like run locate on them, right?
'rev' reverses the order of the characters in each file name
e.g. '/xyz/file2' becomes '2elif/zyx/'
Sorting the reversed file names brings together those which have the same basename.
The sed command uses a 'sliding window' of two lines in the pattern space.
Only adjacent lines with the same first field (delimited by '/') are printed.
If there are more than two files with the same basename this generates some duplicate lines which are removed with 'uniq'.
Sweet, thanks a lot Kenhelm, I've got my pathnames now and everything, and the outputted file is fantastic, I'll be able to really trim down the filesystems now that I have this command.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.