LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Need Help for coping huge files (https://www.linuxquestions.org/questions/linux-newbie-8/need-help-for-coping-huge-files-4175471311/)

bkarthick 07-29-2013 05:23 AM

Need Help for coping huge files
 
Hi All,

My manager has given below task to complete in one week.So i need your help.

"For promoting contents which are huge in size from one Unix server to another a tool/script will have to be developed which will take a backup and push the contents between servers. You need not develop the tool, but you will have to provide a detailed document on how this script can be developed, scenarios where this can be utilized and what are the tools that are available in the market which serves this purpose"

pan64 07-29-2013 05:49 AM

it depends on the devices, network and some related things. Probably you can try to read the history of rsync (see: http://www.samba.org/~tridge/phd_thesis.pdf). But you can find other backup tools too (not to speak about a single scp command)

TobiSGD 07-29-2013 06:34 AM

For huge files I would always prefer rsync over scp, since it can resume interrupted copies, so that you don't have to start from the beginning.

TB0ne 07-29-2013 08:30 AM

Quote:

Originally Posted by bkarthick (Post 4998804)
Hi All,
My manager has given below task to complete in one week.So i need your help.

"For promoting contents which are huge in size from one Unix server to another a tool/script will have to be developed which will take a backup and push the contents between servers. You need not develop the tool, but you will have to provide a detailed document on how this script can be developed, scenarios where this can be utilized and what are the tools that are available in the market which serves this purpose"

So..your manager gave you a job, and you'd like US to do it for you?? How about telling us what you've come up with already, and show us what effort you've put forth of your own?

rsync and scp have already been mentioned, but you don't provide sufficient details for anyone to give you much more. Are you using a SAN? Bandwidth between the servers? Do you already have a backup system in place, and if so, why can't you just use it? Considered DRDB for that content? What do you consider 'huge in size'??

szboardstretcher 07-29-2013 08:40 AM

If I can get my hands on the design document, I'm sure I could whip something up real quick. But, honestly, I'd probably feel like I should be paid for it.

Its not often that people actually come out and say "I have X to do at work, what do i do?"

It might be time to re-evaluate your position if copying data from A to B is going to be a difficult project for you.

jpollard 07-29-2013 11:10 AM

It depends on the definition of "huge".

One problem I have seen with rsync is that it has to first scan the directory tree for new files... before starting even the first file.

When you have 50 million files to scan... it can take several days before it even starts.

Now copying a few 100-200GB files is not that hard. Copying 50,000 might be..

szboardstretcher 07-29-2013 11:25 AM

Rsync is nice, as you mentioned. But for a "smaller" set of files.

If you need to copy 50,000,000 files to another machine, the fastest way I know to do it is with 'dd', 'netcat' and 'bzip2' but there is a lot that goes into doing it, and the circumstances have to be made just right.

TobiSGD 07-29-2013 12:21 PM

Quote:

Originally Posted by szboardstretcher (Post 4998968)
Rsync is nice, as you mentioned. But for a "smaller" set of files.

If you need to copy 50,000,000 files to another machine, the fastest way I know to do it is with 'dd', 'netcat' and 'bzip2' but there is a lot that goes into doing it, and the circumstances have to be made just right.

In that case I may be a good idea to use ssh together with tar and [insert favorite compression command here], if encrypted transmission is necessary. netcat transmits the data unencrypted, AFAIK.

szboardstretcher 07-29-2013 12:27 PM

Quote:

Originally Posted by TobiSGD (Post 4998995)
In that case I may be a good idea to use ssh together with tar and [insert favorite compression command here], if encrypted transmission is necessary. netcat transmits the data unencrypted, AFAIK.

TobiSGD is 100% Correct. Netcat is unencrypted. But far far FASTER than encrypted. I use it because I have a closed network, and nothing above 'sensitive' as far as information goes. If I were sending over the tubes I would *at least* use SSH.

If you want to tar a zip over ssh, it's certainly an option. But it will be slower.

SSH supports its own compression out of the box, as well.

jpollard 07-29-2013 12:48 PM

That is why we need more information before making any full recommendations. One of the issues is "push"... does this mean that it needs to be done more than once? Is is a true backup being copied, or just copying files to multiple servers... Is NFS connected between them? How many files, how large, how often?

I did develop a perl script to migrate from one 12 TB filesystem to a 16TB system. Not exactly fast, but there were other considerations (not changing the access time for one), and the need to sync the two while online... rsync was too slow, and some of the files could change while copying, backup/restore too slow (single network connection for the entire thing) and the need to search for new files faster than they could be created... My system had NFS mounts to both servers, so I could use a multi-threaded search (45 minutes for scanning both filesystems with 12 threads, with no updates), and a couple of threads doing nothing but copying files identified by the first 12. And it had a checkpoint/restart feature.

It didn't try to resume file copies, but the individual files were small enough (5-10M) that it didn't matter. What did matter was resuming the search threads (and the list of identified files). The NFS servers DID have to be tuned for this (I ended up using 64 NFS daemons to keep things busy).


All times are GMT -5. The time now is 02:03 AM.