Bash Shell scripting : File and it's metadata - verification
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
The 2 key attributes of a file to check after a copy are filename and chksum the content.
Unless you've gone out of your way to change the filename, then chksum alone is sufficient and what most people use.
Of course if you use a tool like rsync that does the chksum as part of its work, then you don't even need to do that (although you can afterwards if you're paranoid).
Incidentally, use the stat cmd if your really want to check ctime/mtime/atime (not that it will achieve anything worthwhile).
As above, there's no such thing as 'creation time' in *nix.
If its that important, embed it into the filename.
That said, I was interested in rsync but I haven't been able to get it to work. Supposedly my nas supports it, but I doubt my source machine supports it. I forgot what went wrong exactly but maybe I was just doing it wrong....
So you mount your nas dir somewhere and try to sync it to your "another" dir. rsync --dry-run will tell you all the differences it found, it should only be executed on the local host and you can use almost any version of rsync to do that. It is definitely created for that kind of job, therefore I do not think you need to reinvent it again. rsync is available for almost every os.
That only helps a little, and doesn't help with the memory as the sum of each job still adds up.
And unless the jobs run in parallel, it doesn't improve the time either.
The largest number of such jobs possible (well, on the equipment I had) would be 12. After that the filesystems saturate and start causing delays - and there is still the delay while it checks the two filesystems (It would still take several weeks to transfer the data). And if a job aborts, it has to scan the filesystems again from the start.
My workaround was to write a perl script that handled the scan in parallel with the file transfer. Making the script checkpointable allowed it to restart without repeating the scan... and not repeat copying files already done. I got the scan down to 45 minutes if there was nothing to do. But add the file transfer and things could take a good bit longer. The first complete pass (scan and copy) took a bit over three weeks (checkpoints every 6 hours). But repeats got faster, until it was under a couple of hours with the normal updates being made.
Break it into x number of pieces, then run the jobs serially. And compare it to the BIG job.
date; rsync -varh --progress here there:/tmp/; date
I've done this over and over and written (time rsync) comparisons in my department, and the serialized rsync jobs always complete first. Watching memory consumption, it always utilizes less.
In my experience for some reason smaller rsync jobs will finish faster than one BIG one. And Im talking about 50% text log files, 50% binary files,.. numbering around the 2 million range at 800GB total.
Last edited by szboardstretcher; 05-13-2014 at 11:30 AM.
That is because the scans take so long. The logical conclusion is to do one rsync job per file....
The trouble with serial syncs is that only one file at a time is transferred - taking forever (50 million remember).
And using rsync at that resolution (even down to a single directory...) is too slow. And since the directory tree is not that small, even breaking them down to the directory level doesn't work well.
BTW, the filesystems involved were 16TB. And the file servers involved would saturate due to the scans.
rsync works... it just doesn't scale well, and it isn't always possible to break a large tree down enough to make it fast. It is fast enough for leaf directories (well normally - but if you have 5,000+ files in a directory it isn't all that fast). But for thousands of intermediate level directories - it sucks. Works correctly if you start at the top level... but that doesn't allow enough of a breakdown. Even if the top level has a couple of hundred directories (plus files) you can only break it down into a couple of hundred jobs. Yet that leaves 10s of thousands of files and directories below that... and the scan again adds up...
The perl script I had completely separated the scanning from the copying. And the scanning itself could grow as fast as the number of directories found (thus, I had controls to throttle it). The copying would be done as fast as possible - but due to the nature of the network and the two filesystems involved, that was limited to about 4 files in parallel (after that, the throughput dropped).