LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-02-2009, 11:07 AM   #1
blood_ocean
LQ Newbie
 
Registered: Sep 2009
Posts: 4

Rep: Reputation: 0
Need help using awk/sed to make this command work


I'm fairly good at doing stuff in shell but I suck at using sed and awk, and I think I need somebody with more experience to show me the rest of the way to do this. See this command?
Code:
{ ( cd /media/disk-1; find ); ( cd /media/disk-2; find ) } | sort | uniq -d > filesystemcomparison
I want to use this on my ubuntu system to compare the contents of two external hard drives and come up with a list of which files are duplicates so that I can delete them. And it works, sort of. Problem is, uniq will only return results of duplicates that have the same directory structure. So it will show up some dupes, for example /media/disk-1/Music/blindmelon.allthatineed.mp3 and /media/disk-2/Music/blindmelon.allthatineed.mp3, but if one of those copies is in, say, the root of that filesystem 2 or really any other directory other than Music, it won't show up in the resulting output. I guess it appears to uniq as having a different absolute filename and uniq treats it as if it were not a duplicate when it actually is. I have a feeling that something like sed will be necessary to cut out the paths for the files before they go to uniq, but I'm not sure how to do it, and even if it were done it wouldn't be perfect since then I wouldn't have any indication where the dupes are in the filesystems since their paths would no longer be part of the resulting text file. So I'm a little stumped too, but a half working version is better than none unless somebody has any alternative method?
 
Old 09-02-2009, 12:27 PM   #2
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Well, you can use 'basename' to get just the end file name, for example:

Code:
bash-3.1$ basename /usr/local/bin
bin
bash-3.1$ basename /usr/local /bin
local
bash-3.1$ basename "/usr/local /bin"
bin
 
Old 09-02-2009, 01:21 PM   #3
blood_ocean
LQ Newbie
 
Registered: Sep 2009
Posts: 4

Original Poster
Rep: Reputation: 0
But I can't pipe that data to basename, can I?
 
Old 09-02-2009, 01:26 PM   #4
TBC Cosmo
Member
 
Registered: Feb 2004
Location: NY
Distribution: Fedora 10, CentOS 5.4, Debian 5 Sparc64
Posts: 356

Rep: Reputation: 43
Use it like this:
Code:
{ ( cd /tmp/dir1; find -exec basename {} \; ); ( cd /tmp/dir2; find -exec basename {} \; ) } | sort | uniq -d > filesystemcomparison
Other half of solution is getting the paths to those dup files.
 
Old 09-02-2009, 02:33 PM   #5
blood_ocean
LQ Newbie
 
Registered: Sep 2009
Posts: 4

Original Poster
Rep: Reputation: 0
Ah okay well this is great! Thanks for recommending basename tex, looks like it's working good, and thanks for the clarification on syntax with it cosmo, not a lot of documentation in the man files for this one.

So about the second part of this, the problem of how to get back the full pathnames for these filenames I have. Any thoughts? The command finished running after about 15 minutes and left me with a nice list of 20,000 files. I could do something like run locate on them, right?
 
Old 09-02-2009, 10:26 PM   #6
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 360

Rep: Reputation: 170Reputation: 170
This method uses GNU sed.
It keeps the full pathnames.
Code:
echo '
/ghj/nm/cvb/file1
/ghj/nm/cvb/file2
/ghj/nm/cvb/file3
/ghj/nm/cvb/file4
/ghj/nm/cvb/file5
/ghj/nm/cvb/file6
/abc/file4
/abc/file3
/qwe/rty/file4
/xyz/file2
/jkl/file3
/mno/file4' |
rev| sort| sed -nr ':a N;s%(^[^/]*/).*\n\1.*%&%p;D;$!ba'| uniq| rev

/ghj/nm/cvb/file2
/xyz/file2
/ghj/nm/cvb/file3
/abc/file3
/jkl/file3
/ghj/nm/cvb/file4
/abc/file4
/mno/file4
/qwe/rty/file4
'rev' reverses the order of the characters in each file name
e.g. '/xyz/file2' becomes '2elif/zyx/'
Sorting the reversed file names brings together those which have the same basename.
The sed command uses a 'sliding window' of two lines in the pattern space.
Only adjacent lines with the same first field (delimited by '/') are printed.
If there are more than two files with the same basename this generates some duplicate lines which are removed with 'uniq'.
 
Old 09-04-2009, 12:06 PM   #7
blood_ocean
LQ Newbie
 
Registered: Sep 2009
Posts: 4

Original Poster
Rep: Reputation: 0
Sweet, thanks a lot Kenhelm, I've got my pathnames now and everything, and the outputted file is fantastic, I'll be able to really trim down the filesystems now that I have this command.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed and awk command hot bird Linux - Newbie 3 01-26-2009 10:58 AM
shell command using awk fields inside awk one71 Programming 6 06-26-2008 04:11 PM
Newbie SED / AWK / Regex command help request Critcho Linux - Newbie 10 03-19-2007 11:22 AM
sed / awk command to print line number as column? johnpaulodonnell Linux - Newbie 2 01-22-2007 07:07 AM
Sed/Awk command help needed. farmerjoe Programming 3 03-02-2005 11:13 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration