LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 03-30-2011, 10:13 AM   #1
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
How to control file order in rsync


I am looking for a way to control the order in which files (or more specifically, certain subtrees) are updated for a very large rsync update done daily. There is a finite time frame and finite bandwidth to do these updates, and they usually do NOT complete. I want to control the order so that more important files are copied earlier.

What I have been doing is breaking up the rsync runs into a few steps where the first step is a very limited set of subtrees using --include and/or --exclude, and each step thereafter is increasing the amount, until the last step is the whole tree to be updated. This is not working because the time it takes to transfer the file list is large (varying from 10 minutes to 40 minutes depending on step). I really need to get this back down to doing a single rsync run to reduce that wasted time.

I cannot ultimately split things because there are a lot of hardlinks that need to be retained (or else the destination space and bandwidth would be exceeded).

I was looking at rsync's batch mode. My thought was if I could get a list of what needs to be updated, I could then sort that list and do just 1 more rsync run to do the updates in the order I want. But it turns out the batch mode file is in some cryptic binary format. I really have no idea if that could have worked.

The order selection is not simply based on the names in the top directory, so I cannot simply just list those in the preferred order on the command line (otherwise, yes, that would control the order). If I do list directory names from various depths, that list gets flattened into the destination directory, which would not work to replicate the whole tree exactly.

Any other ideas?
 
Old 03-30-2011, 01:08 PM   #2
bigrigdriver
LQ Addict
 
Registered: Jul 2002
Location: East Centra Illinois, USA
Distribution: Debian stable
Posts: 5,908

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
Along with the include/exclude options, you also have the --files-from=file option. The file is a text file listing directories/files, which will be processed in the order listed.

You may be able to generate a file quickly via:
touch file
ls -R /some/path >> file

then edit the file to get it just the way you want it.
 
1 members found this post helpful.
Old 03-30-2011, 01:46 PM   #3
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by bigrigdriver View Post
Along with the include/exclude options, you also have the --files-from=file option. The file is a text file listing directories/files, which will be processed in the order listed.

You may be able to generate a file quickly via:
touch file
ls -R /some/path >> file

then edit the file to get it just the way you want it.
I don't see anything in the man page that says --files-from=file will process in the order given. I ran a little test:
Code:
lorentz/phil /home/phil 927> find bar -print
bar
bar/src3
bar/src3/3.dat
bar/src5
bar/src5/5.dat
bar/src9
bar/src9/sub3
bar/src9/sub3/21.dat
bar/src9/sub2
bar/src9/sub2/20.dat
bar/src9/sub0
bar/src9/sub0/18.dat
bar/src9/sub1
bar/src9/sub1/19.dat
bar/src1
bar/src1/1.dat
bar/src8
bar/src8/sub3
bar/src8/sub3/17.dat
bar/src8/sub2
bar/src8/sub2/16.dat
bar/src8/sub0
bar/src8/sub0/14.dat
bar/src8/sub1
bar/src8/sub1/15.dat
bar/src0
bar/src0/0.dat
bar/src7
bar/src7/sub3
bar/src7/sub3/13.dat
bar/src7/sub2
bar/src7/sub2/12.dat
bar/src7/sub0
bar/src7/sub0/10.dat
bar/src7/sub1
bar/src7/sub1/11.dat
bar/src6
bar/src6/sub3
bar/src6/sub3/9.dat
bar/src6/sub2
bar/src6/sub2/8.dat
bar/src6/sub0
bar/src6/sub0/6.dat
bar/src6/sub1
bar/src6/sub1/7.dat
bar/src2
bar/src2/2.dat
bar/src4
bar/src4/4.dat
lorentz/phil /home/phil 928> cat filelist
src9/sub3
src9/sub2
src9/sub1
src9/sub0
src8/sub3
src8/sub2
src8/sub1
src8/sub0
src7/sub3
src7/sub2
src7/sub1
src7/sub0
src6/sub3
src6/sub2
src6/sub1
src6/sub0
src5
src4
src3
src2
src1
src0
lorentz/phil /home/phil 929> rm -fr foo ; rsync --files-from=filelist -adHrRSvW bar foo
sending incremental file list
created directory foo
src0/
src0/0.dat
src1/
src1/1.dat
src2/
src2/2.dat
src3/
src3/3.dat
src4/
src4/4.dat
src5/
src5/5.dat
src6/
src6/sub0/
src6/sub0/6.dat
src6/sub1/
src6/sub1/7.dat
src6/sub2/
src6/sub2/8.dat
src6/sub3/
src6/sub3/9.dat
src7/
src7/sub0/
src7/sub0/10.dat
src7/sub1/
src7/sub1/11.dat
src7/sub2/
src7/sub2/12.dat
src7/sub3/
src7/sub3/13.dat
src8/
src8/sub0/
src8/sub0/14.dat
src8/sub1/
src8/sub1/15.dat
src8/sub2/
src8/sub2/16.dat
src8/sub3/
src8/sub3/17.dat
src9/
src9/sub0/
src9/sub0/18.dat
src9/sub1/
src9/sub1/19.dat
src9/sub2/
src9/sub2/20.dat
src9/sub3/
src9/sub3/21.dat

sent 1629 bytes  received 534 bytes  4326.00 bytes/sec
total size is 0  speedup is 0.00
lorentz/phil /home/phil 930> rsync --version
rsync  version 3.0.7  protocol version 30
Copyright (C) 1996-2009 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.
lorentz/phil /home/phil 931>
It looks like it collected and sorted the file list (e.g. ignored my order). It looks like it also sorts the argument list, too, contrary to what I previously said.
 
Old 03-30-2011, 01:48 PM   #4
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 4,278

Rep: Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694Reputation: 1694
Code:
rsync -varhz --progress /most/important/tree /somewhere
rsync -varhz --progress /2ndmost/important/tree /somewhere
rsync -varhz --progress /3rdmost/important/tree /somewhere
rsync -varhz --progress /the/rest /somewhere
 
Old 03-30-2011, 02:14 PM   #5
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by szboardstretcher View Post
Code:
rsync -varhz --progress /most/important/tree /somewhere
rsync -varhz --progress /2ndmost/important/tree /somewhere
rsync -varhz --progress /3rdmost/important/tree /somewhere
rsync -varhz --progress /the/rest /somewhere
Of course that seems obvious. That is what I am doing now, or at least the include/exclude variant of that, so that each run won't split off hardlinks that need to stay hardlinked. So it's more like:
Code:
rsync ... --include most/important ...
rsync ... --include most/important --include next/important ...
rsync ... --exclude least/important --exclude nextleast/important ...
rsync ... --exclude nextleast/important ...
rsync ... ...
where that last step covers everything. This way no run will fail to sync a file that was copied in a previous run that could be hardlinked in the current run. Since there's a bit more gradient in my case, it results in 12 steps and more complex. With nearly 20 million files involved (and growing each day), each of those 12 runs starts up slowly, with the first and smaller ones more quickly, and the last one exceedingly slow. Bandwidth is a T1 so it takes a while to transfer all those file names. Total run time is limited to 4 hours each day (6 hours on weekends) after which the process(es) must be terminated (by the next day, there are new files added, so it may or may not even get back to the resumption point). I'm typically getting 4 or 5 runs done in a day before time runs out, and 7 to 8 runs on weekends. In the past 2 months only one weekend run even finished the whole thing.

It's those redundant restarts, and transferring the file names all over again, that I want to recover.
 
Old 03-30-2011, 07:58 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Whatever you do, it's going to have to checksum all(!) the files at least once, in order to do the differences calc, so that it only sends differences, assuming that's why you're using rsync.
If you are, you may be running into a fundamental limit somewhere, possibly the network cxn.

If you have 12 lists/cmds, you could try a little parallelism eg run the 1st 3 or 4 cmds simultaneously, wait for txfrs to complete, then the next set, wait, then the next..

If the overhead of the diff calc is the bottleneck, could you just set it to txfr the complete files instead: rsync or scp or rcp?
Also, are you using the encryption option, which also slows it down? If you are, ask if you're allowed to turn that off.

Have you considered remote mounting destn disk via nfs or sshfs and doing a 'local' copy; might be faster.

You may need to use more than one suggestion to get in under the time limit.
 
Old 03-31-2011, 11:52 AM   #7
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Running 3 or 4 in parallel would mean extra bandwidth demand we can't handle, where they are transferring in parallel. If they transfer the same file at the same time, that's all a waste. And that could happen, in theory, since I cannot split the ranges of files.

I really just need to get things down to fewer, or even just one, rsync run. And I have tried doing one run covering the entire tree last year when I set this up, and it did work much faster, often getting the entire tree done. But when it would not get done, the important files were too often not reached. I can't just rename important files to lower collating order. I've been wondering if I could substitute my own mapped sorting for rsync's sorting.

I don't think the diff calc overhead is the issue, since that happens only when a file is found to appear to need updating. Each file would only hit that once through all 12 runs. I think the excess overhead is in the time to scan the file tree AND to transfer names to decide if a file should be updated or not.

I think the remote mount would be an issue as well, because that would still involve scanning the remote file tree ... 12 times.

At this point the only idea I can come up with that seems plausible is to replace the sorting algorithm rsync uses with one that takes each name being compared, and looks up a priority value that gets inserted as a name string prefix for the comparison step.
 
Old 02-27-2013, 12:27 PM   #8
miketwo
LQ Newbie
 
Registered: Feb 2013
Posts: 1

Rep: Reputation: Disabled
Solution?

Skaperen, I find myself in a similar situation. Did you ever find a workable solution?
 
Old 06-16-2020, 05:58 AM   #9
pps753
LQ Newbie
 
Registered: Jun 2020
Location: Switzerland
Distribution: Ubuntu, Kubuntu, Debian
Posts: 1

Rep: Reputation: Disabled
Solution! But a cheap one...

This is a very old thread, nevertheless I found nowhere else a solution.

One solution is to rename the directories/files so they show up alphabetically in the order I want them to be processed.

Worked for me on Ubuntu and Synology.
 
Old 06-16-2020, 07:43 AM   #10
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,620

Rep: Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695Reputation: 2695
I found a solution, but it may not help you.
I had unused NIC interfaces on my servers, so I ran CAT-5 between them and locked them at the highest speed full duplex and did the transfer over that connection.

No worry about saturating that network, since that was what it was MADE for and that was all the traffic on that connection. The actual transfers ran faster, but the big thing was that if the transfer ran long the impact to production was minimal and acceptable.
 
Old 06-19-2020, 08:10 PM   #11
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
a workaround that did not feel like it was working as well as i thought controlled order would work was to look for files with dates since the last successful run, and just copy them directly before the big rsync run. it improved things but it didn't consider it to be a solution.

i am no longer working on that problem because i no longer work there.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How do you control the order init.d scripts are started in SLES 10.2 64-Bit? j_unix Linux - Server 5 04-16-2009 12:42 PM
Cron'd rsync process out of control. ghostmac Linux - Software 2 11-12-2008 09:30 PM
In order to share file me_too Linux - Newbie 3 09-18-2008 09:44 AM
File natural order JGren Linux - Newbie 1 06-11-2008 10:13 AM
Samba and File order cantthinkofausername Linux - Networking 1 10-14-2005 08:58 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 02:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration