LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-01-2012, 04:15 AM   #1
B Akshay
Member
 
Registered: Sep 2012
Posts: 39

Rep: Reputation: Disabled
cksum usage


I am trying to get the CRC of the folder containing nearly 25K files. The time taken for the process is too much.
Can any one please suggest a alternative for it.(i am newbie to this)

the application is to check any changes done in any of the files of the folder


Thanks & Regards.
 
Old 10-01-2012, 04:58 AM   #2
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,576
Blog Entries: 31

Rep: Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195
The cheapest way to do this task is to:
  1. Build a record of all the files' names, mtimes (last modification time) and checksums. For 25k files it would be worth using a database.
  2. When looking for changed files, loop on their names and:
    1. If the name does not exist, the file is new otherwise ...
    2. If the mtime is unchanged, the file is unchanged, otherwise ...
    3. If the checksum has changed, the file has changed.
EDIT: this is for normal file system usage. It does not work if whatever might change the files also sets their mtime to the old value.

Last edited by catkin; 10-01-2012 at 05:01 AM.
 
Old 10-01-2012, 05:17 AM   #3
B Akshay
Member
 
Registered: Sep 2012
Posts: 39

Original Poster
Rep: Reputation: Disabled
Thanks a lot catkin!

The option 2 suggested by you is good!!! but it won't consume more time???

can you suggest any other command like cksum it will do my work or i have to write a script for it.

What about the hash tables and how i can create hash tables in linux for the above mention LQ.
 
Old 10-01-2012, 05:29 AM   #4
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,104

Rep: Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267
can you tell us the total size of the dir? probably rsync can be used, but maybe not really efficient
 
Old 10-01-2012, 05:43 AM   #5
B Akshay
Member
 
Registered: Sep 2012
Posts: 39

Original Poster
Rep: Reputation: Disabled
on an average 2.5 gb is the size of the directory. The above mentioned is done because many user are doing changes in the file(different copies of same directory) on trial and error method. so at last we can only check changed files.

Can we do this crc or hash table thing in Perl and can i invoke it through the bash script???
Will it be easy in PERL???

Thanks for your response!!!
 
Old 10-01-2012, 05:46 AM   #6
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,576
Blog Entries: 31

Rep: Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195
Any checksum is expensive in computer resources
 
Old 10-01-2012, 05:53 AM   #7
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,576
Blog Entries: 31

Rep: Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195
Quote:
Originally Posted by B Akshay View Post
The option 2 suggested by you is good!!! but it won't consume more time???

can you suggest any other command like cksum it will do my work or i have to write a script for it.

What about the hash tables and how i can create hash tables in linux for the above mention LQ.
Item 2 is designed to run in the shortest time, assuming few files have changed. The loop on file names would be required for all solutions. Getting a file's mtime is very cheap/quick compared to computing its checksum.

SHA1 sums are considered more robust than MD5 sums but AFAIK take longer.

Please clarify your last question.
 
Old 10-01-2012, 07:16 AM   #8
B Akshay
Member
 
Registered: Sep 2012
Posts: 39

Original Poster
Rep: Reputation: Disabled
can crc be obtained using perl script will it be easy to use ??? like cksum do we have any inbuilt command in perl for it???

what is SHA1, MD5 and AFAIK??
I just want the direct implementation which will take less time for processing
 
Old 10-01-2012, 07:39 AM   #9
Ginola
Member
 
Registered: Sep 2012
Location: UK
Distribution: CentOS, RHEL
Posts: 65

Rep: Reputation: Disabled
Out of interest, can I ask why you want to cksum 25K+ files in a single directory? Seems like you have given yourself a mountain to climb. I would split them into sub-directories, then monitor changes in said sub-directories with something like this:

for DIR in `ls -d *`;do echo $DIR;tar cf - $DIR | cksum; done

You could even run something similar in parallel to get going quicker.
 
Old 10-01-2012, 05:05 PM   #10
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,576
Blog Entries: 31

Rep: Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195
Quote:
Originally Posted by B Akshay View Post
can crc be obtained using perl script will it be easy to use ??? like cksum do we have any inbuilt command in perl for it???

what is SHA1, MD5 and AFAIK??
I just want the direct implementation which will take less time for processing
The Internet Slang Dictionary is useful for, er, Internet Slang like AFAIK.

IDK (see Internet Slang Dictionary!) about Perl's checksumming facilities but, if they do exist which is likely, they will take around the same resources as native GNU/Linux commands like cksum.

Wikipedia is good for the likes of MD5 and SHA1. They are the checksums most commonly in use.

As already stated, checksumming is inherently resource-intensive. The people who wrote the various utilities available will have tried to do a good job so all of them will have much the same performance.
 
Old 10-01-2012, 05:26 PM   #11
jefro
Moderator
 
Registered: Mar 2008
Posts: 15,374

Rep: Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198Reputation: 2198
Would time stamps on the file be of any use in this?

The types of files also may allow other tools such as diff maybe.
 
Old 10-02-2012, 06:38 AM   #12
B Akshay
Member
 
Registered: Sep 2012
Posts: 39

Original Poster
Rep: Reputation: Disabled
As suggested by Ginola, if i do the same using fork , so that processes will simultaneously do cksum on subfolders into the main folder, will it decrease the processing period?? i have not tried this thing..
 
Old 10-02-2012, 06:52 AM   #13
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,104

Rep: Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267
rsync uses a lightweight crc, so I would say it would be faster than sha1 or md5 (they are both "high precision" checksums). The only additional requirement is to have an original set of files (to compare with) Rsync will compare the two directory trees and find all the differences (and also it can sync them).

running cksum simultaneously will speed up the whole process, but also will increase the load of the system (probably you cannot start 25k processes in the same time, you need to organize it somehow).
 
Old 10-02-2012, 07:12 AM   #14
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
You can definitely multi-process in eg Perl, but have you considered using a code ctrl system, so that you will get a record each time a file is changed, instead of having to post process all 25k files eg once a day.
Also, as above, checking mtimes will be (much) faster than checksumming file contents.
 
Old 10-02-2012, 08:06 AM   #15
B Akshay
Member
 
Registered: Sep 2012
Posts: 39

Original Poster
Rep: Reputation: Disabled
pan64 , rsync would check the differences in the two directory trees, but my application needs that some values or marks like checksum would be present for the original as well as the updated version. so that it can generate a report that there is a change in data of file. Also the no of versions are >10.


chris, checking mtime only is not satisfying the desired.


Can any one suggest how to organize the sub folders so that it can be used with fork. Any probable errors while using fork??


Thanks!!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
script: rename files by using cksum dudulica Linux - Newbie 8 11-06-2010 04:37 PM
How do you cksum a directory? custangro Linux - General 3 05-04-2010 11:32 AM
md5sum vs. cksum vs. sha1sum bichonfrise74 Linux - Newbie 3 07-22-2009 09:27 PM
bad cksum in tcpdump v_fone Linux - Networking 7 06-15-2009 03:16 PM
cksum problems Xris718 Linux - General 6 01-11-2004 07:31 PM


All times are GMT -5. The time now is 12:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration