LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 12-14-2007, 08:12 AM   #1
ufmale
Member
 
Registered: Feb 2007
Posts: 385

Rep: Reputation: 30
comparing files in 2 large dirs


what is the best way to compare the smalll files from 2 dirs, each have 900 Gb of data. In windows, i use "beyond compare", but it cannot handle the large number of files. Anyone know any good tools or how to compare them, please help.
 
Old 12-14-2007, 08:17 AM   #2
Simon Bridge
Guru
 
Registered: Oct 2003
Location: Waiheke NZ
Distribution: Ubuntu
Posts: 9,211

Rep: Reputation: 197Reputation: 197
What is it about them you want to compare?
What is the aim of the comparison?

i.e. you want to make sure the most up-to-date version of each file is in one (or both) directories? You want to quick-check if there are files in one directory that are not in the other? You want to make a log of differences in content between pairs of text files with the same names? You want to see if files with different names are, in fact, identical in content? See what I mean?

Usually, large batch jobs in linux are handled with scripts.
The diff utility is often used to compare text between two files.
 
Old 12-14-2007, 08:17 AM   #3
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
When you say compare, do you want to what what the differences are, or simply to know which files are different?

I assume the files have the same name in both directories?
 
Old 12-14-2007, 08:21 AM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Perhaps you could produce 2 md5sum lists, one for each directory and then compare the sums. Using the "find" command to list the files with an "-exec md5sum '{}' \; >md5sumlist" argument, and redirecting the output to a file, you can option a table of files & sums that you can use to locate duplicate files. This can be used to find duplicates when you don't know where on the filesystem a duplicate might be found.

You might consider organizing your data better so that you don't have so many files in each directory. You won't be able to use fileglobbing in these directories because the list would be too large to pass as arguments to a command. You will often need to resort to "find" and "xargs" so you can limit the number of arguments handled at a once.

Last edited by jschiwal; 12-14-2007 at 08:28 AM.
 
Old 12-14-2007, 10:15 AM   #5
ufmale
Member
 
Registered: Feb 2007
Posts: 385

Original Poster
Rep: Reputation: 30
hmm. md5sum sound good. I will use the find to list all the files and md5sum each file. That would work.. Thanks

I was transfering data from one drive to the other, and want to make sure that everything are copied correctly. I used rsync to verify, but it crash for some reason.
 
Old 12-14-2007, 06:53 PM   #6
MQMan
Member
 
Registered: Jan 2004
Location: Los Angeles
Distribution: Slack64 13.37
Posts: 536

Rep: Reputation: 36
Go look at md5deep. You can produce a file of md5sums, from one directory. You can feed that into md5deep, pointed at the other directory, and it will spit out only the files that are different.

I've used it to verify that a copy of a complete disk was identical to the original.

Cheers.
 
Old 12-15-2007, 07:32 AM   #7
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
If the directory structures are the same on the two drives, running the find command in the corresponding directory of each drive will produce two lists that should be identical. You could simply use "diff" or sort both lists and use "comm -13 md5sumlist1 md5sumlist2" to find altered files from the 2nd list.

E.G. comm -13 <(sort md5sumlist1) <(sort md5sumlist2)
or
sort md5sumlist1 >md5sumlist1.sorted
sort md5sumlist2 >md5sumlist2.sorted
comm -13 >altered_files_in_list2

Last edited by jschiwal; 12-15-2007 at 07:38 AM.
 
Old 12-16-2007, 02:19 PM   #8
glenn69
Member
 
Registered: Jul 2003
Location: Chicagoland
Distribution: ArchLinux
Posts: 261

Rep: Reputation: 32
Quote:
Go look at md5deep. You can produce a file of md5sums, from one directory. You can feed that into md5deep, pointed at the other directory, and it will spit out only the files that are different.
How exactly does one create an md5sum list? I haven't had much luck figuring that out.

Thanks
 
Old 12-17-2007, 12:45 PM   #9
MQMan
Member
 
Registered: Jan 2004
Location: Los Angeles
Distribution: Slack64 13.37
Posts: 536

Rep: Reputation: 36
md5deep ... > MD5SUMS.TXT

Then, use -x MD5SUMS.TXT when you launch in the 2nd directory. That way, it will only print out the files that are different.

Cheers.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Comparing text files... jong357 Slackware 14 03-31-2007 04:29 PM
Comparing 2 Files for Duplicates Mr_H Linux - Newbie 5 11-09-2005 12:43 PM
invisible files and dirs karmine Slackware - Installation 8 12-26-2004 03:05 PM
ls, dirs first, files later TroelsSmit Linux - Newbie 4 05-31-2004 11:47 AM
Comparing 2 Files xianzai Programming 2 05-23-2004 11:50 AM


All times are GMT -5. The time now is 03:42 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration