[SOLVED] rsync stops mid-transfer

suicidaleggroll · 11-10-2015, 02:12 PM

I have a backup script that syncs the entire filesystem on my laptop to a backup system on the LAN. It runs every night, and as soon as the rsync finishes, it renames the backup directory to <name>.complete and updates a "latest" symlink to point to it. When it runs the next time, it grabs the target of the symlink (the previous complete backup) and uses that as the argument to --link-dest.

The goal is that I have daily incremental backups with a symlink that always points to the most recent one. Any backups that fail part way through are left on the backup machine, but aren't renamed to .complete, don't get the symlink updated, and aren't used as the source for --link-dest on the next backup.

This backup system has been working flawlessly every night for the last ~3 months.

For the last three days, however, the backups have not completed. Today I decided to run the backup interactively to see what was happening, and oddly it simply stopped during the transfer as soon as it got to /var/lib/mlocate/mlocate.db. I went back and looked at the previous three backup attempts, and sure enough all of them got to mlocate and then stopped. As soon as rsync prints out that filename (using -v flag) all activity stops, the file does not get created/hardlinked on the destination, and the rsync process on the destination quits. It then sits there indefinitely, the rsync process on the laptop still running, the ssh connection on both sides still running, but CPU usage stuck at 0, no activity, and no mlocate.db gets created on the destination.

The file is only 8 MB and permissions are normal. Fearing some kind of corruption, I copied the file to my home directory on the laptop, which went fine, I re-ran updatedb, and ran locate a few times, all behaved fine. I then went into the backup machine and deleted mlocate.db from the most recent complete backup (the one that it's been trying to use with --link-dest), thinking that it was getting stuck on the comparison and without that file it would just copy it over from the laptop from scratch, but this changed nothing.

What could be going on? Why does it keep stopping at this exact file every time? Disk usage at the destination is only 20%, inode usage is 2%.

The laptop is on OpenSUSE Tumbleweed, the backup system is CentOS 7, both up to date.

joe_2000 · 11-10-2015, 02:44 PM

A couple of thoughts / things I would try to debug further:

First, confirm that the issue is really related to that specific file.

--exclude the file from the script to see if rsync still gets stuck and, if so, see if it gets stuck right before or after the the point where it would have been the file's turn until now.

As an alternative to test the same thing you could manually rsync only that file (obviously using the same rsync options).

If the above confirms that this file is the cause for the issue, you could consider to just exclude it, as a backup of the mlocate database has no value imho (you can always rebuild if needed, plus when you need to recover data from the backup you probably won't even want to overwrite your current database with one from the backup).

If you want do analyze further just for the sake of understanding (I guess personally I would) you could run a watch on lsof for the file while rsyncing to see if another process is somehow locking it. (I'd watch on both machines, in case you actually can on the NAS).

And... sorry for stating the obvious... have you tried rebooting (both machines).

Lastly: A side comment: I am backing up my laptops and my wife's laptop with a similar script. I have had good success running the cronjob hourly, not daily, for the last ~ 3.5 years. Due to the high frequency each individual run only takes a couple of seconds and I don't have to make sure the laptops are running at specific times / over night...

suicidaleggroll · 11-10-2015, 02:55 PM

I went one further and simply deleted the mlocate.db from the laptop and re-ran the backup.

This time it made it past the next file, but stopped a few files after that, same symptoms:

Code:

...
var/lib/ca-certificates/pem/thawte_Primary_Root_CA_-_G2.pem
var/lib/ca-certificates/pem/thawte_Primary_Root_CA_-_G3.pem
var/lib/dracut/
var/lib/dracut/overlay/
var/lib/mlocate/
var/lib/nfs/
var/lib/nfs/rpc_pipefs/
var/lib/nfs/sm.bak/
var/lib/nfs/sm/
var/lib/nfs/v4recovery/
var/lib/ntp/drift/
var/lib/ntp/drift/ntp.drift
var/lib/ntp/var/run/ntp/
var/lib/ntp/var/run/ntp/ntpd.pid
var/lib/plymouth/boot-duration
var/lib/postfix/master.lock
var/lib/rpm/
var/lib/rpm/Basenames

And that's the end of it. When it got to that point both ssh and rsync on both sides were still running, but after about 10-20 seconds the rsync process on the backup machine exited. Now it's just sitting there, rsync and ssh running on the laptop, ssh running on the backup machine, with no activity and no output after the above. /var/lib/rpm/Basenames is ~12 MB, should take less than a second to transfer.

I've rebooted both machines since this problem started a couple of days ago.

joe_2000 · 11-10-2015, 03:02 PM

Weird. Sorry for re-asking... have you tried rebooting? This all sounds a bit like stuck processes or so. Maybe an old ghost instance of rsync is still hanging around? (By the way I recommend using lock files for such scripts, I can post code if you are interested).

Do you have other machines you could use to test to narrow things down? E.g. test from another laptop with the NAS, and test from your laptop to another machine?

What network link are you using? If it's a wifi connection, consider testing ethernet.

suicidaleggroll · 11-10-2015, 03:23 PM

Yeah both machines have been rebooted, today actually. There aren't any old rsync processes, there was when I got up this morning, but I shut it down and then rebooted the laptop, then rebooted the backup machine, then tried to re-run the backup by hand.

I could rsync from the laptop to the main server, maybe I'll try that tonight.

It's on wifi, but the connection is stable. I'm actually doing all of this work remotely. If the connection was hanging, then my ssh session where I'm monitoring the progress would be hanging too. Might as well plug it in when I back up to the other machine anyway though, it'll run a little faster.

I just went back and looked at my logs too. 11/7 at 2am was the last successful backup, and 11/7 at 6pm I did an update (standard zypper up) which included a few hundred packages. Neither rsync nor ssh were in the list of updates though, and neither were any of the libraries that are used by rsync, as near as I can tell. It did include a kernel update, among other things (4.2.3-1 to 4.3.0-1).

joe_2000 · 11-10-2015, 03:31 PM

One more thing you could try is running the rsync prepended by strace. Maybe that'll produce some output that gives you further pointers.

suicidaleggroll · 11-10-2015, 04:41 PM

Lovely...I add strace to try and track down the problem, and it fixes the problem.

Six backup attempts in a row that all fail at the exact same point (except that one where I removed mlocate.db, where it failed a few files later) with the exact same symptoms. Then I add an strace to the call, and it runs to completion. I'm back to the original script now with no changes, and it still runs just fine. I can't reproduce the problem. Six successful backups in a row now.

You know when you're writing some software and you run into a seg fault, so you add some print statements to track it down, and it stops seg faulting? I have a coworker who jokingly refers to those as "load-bearing print statements". Apparently that was a load-bearing strace...

joe_2000 · 11-11-2015, 09:27 AM

Quote:

Originally Posted by suicidaleggroll

Lovely...I add strace to try and track down the problem, and it fixes the problem.

Six backup attempts in a row that all fail at the exact same point (except that one where I removed mlocate.db, where it failed a few files later) with the exact same symptoms. Then I add an strace to the call, and it runs to completion. I'm back to the original script now with no changes, and it still runs just fine. I can't reproduce the problem. Six successful backups in a row now.

You know when you're writing some software and you run into a seg fault, so you add some print statements to track it down, and it stops seg faulting? I have a coworker who jokingly refers to those as "load-bearing print statements". Apparently that was a load-bearing strace...

Interesting indeed. Glad to hear your problem is solved, although the solution is a bit unsatisfying. One last thing you may want to do is run a memtest and an extended hard self test of the hard drives with smartmontools.. When a system behaves that weird that rings a bell in my head that the problem might hardware-related.

Beryllos · 11-11-2015, 09:30 AM

It sounds like editing and saving the script solved the problem. Could it be that the script got corrupted in some not-so-obvious way, like a bad byte that doesn't display, and the editor corrected it?

If your backups over the last few days succeeded in transferring the script file, you might be able to discover the corruption by comparison (using cmp or md5sum) of the script file to its backed-up versions over the last several days.

suicidaleggroll · 11-11-2015, 11:33 AM

Actually I think I have an idea what the problem was (not exactly, but it makes more sense than a load-bearing strace).

I added the strace to the rsync command in the script, then ran the script in a screen session. It hung like the previous ones did, but unfortunately screen screwed up and wouldn't let me scroll back to see the relevant output, just the last couple of lines that said it had timed out. No worries, I just Ctrl+C'd the rsync like normal, then ran the backup script again in the foreground so I could actually see the output. This time it ran to completion (along with all subsequent runs).

I was thinking that adding the strace fixed it, but what I think happened is the first one ran and hung, and when I hit Ctrl+C I killed rsync and then strace finished normally. The script saw the 0 exit code from strace, assumed rsync finished correctly, and renamed the directory and updated the symlink as if the backup completed successfully. The next time I ran the script it used that one as --link-dest, rather than the backup from 11/7, and when it reached the point where the previous one had hung it just copied over everything from the laptop fresh (since it was hanging in /var/lib, it was basically at the end of the backup anyway, since rsync runs alphabetically).

So I think there was some file in the 11/7 backup that was corrupt, and rsync was crashing/hanging whenever it tried to access it. I guess the problem file was just a few files ahead of what it was printing...maybe rsync does some read-ahead to accelerate the transfer? So deleting the file it last printed before hanging didn't change anything, but maybe if I had just deleted everything after that file alphabetically it would have been fine.

If I get a bee in my bonnet I might switch the symlink back to the 11/7 backup and do more investigation into what file exactly was triggering the problem.

joe_2000 · 11-11-2015, 05:05 PM

Interesting. I guess if I was in your shoes I'd smart selftest the hard drive on the NAS.
Also you could probably cp -rv the 11/7 backup to /dev/null and hope to see a read error on the corrupt file.

chrism01 · 11-11-2015, 06:01 PM

This is a very interesting thread.
I do think this could be true - it makes sense - in fact I'd be surprised if it didn't..

Quote:

maybe rsync does some read-ahead to accelerate the transfer

Having read the entire thread, I do think that its most likely corrupt file blocks in the (src) target files, not the script itself.
I would crank up the smartmon tool and take a look.

suicidaleggroll · 11-12-2015, 09:59 AM

It is an old drive, but no reallocated sectors or anything else out of the ordinary.

Code:

# smartctl -a /dev/sdc                                                                                                                                     
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.20.1.el7.x86_64] (local build)                                                                                    
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org                                                                                              
                                                                                                                                                                         
=== START OF INFORMATION SECTION ===                                                                                                                                     
Model Family:     Hitachi Deskstar 5K4000                                                                                                                                
Device Model:     Hitachi HDS5C4040ALE630                                                                                                                                
Serial Number:    PL2311LAG1G9PC                                                                                                                                         
LU WWN Device Id: 5 000cca 22ec0aa19                                                                                                                                     
Firmware Version: MPAOA3B0                                                                                                                                               
User Capacity:    4,000,787,030,016 bytes [4.00 TB]                                                                                                                      
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                                                                 
Rotation Rate:    5700 rpm                                                                                                                                               
Device is:        In smartctl database [for details use: -P show]                                                                                                        
ATA Version is:   ATA8-ACS T13/1699-D revision 4                                                                                                                         
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)                                                                                                                 
Local Time is:    Thu Nov 12 08:57:04 2015 MST                                                                                                                           
SMART support is: Available - device has SMART capability.                                                                                                               
SMART support is: Enabled                                                                                                                                                
                                                                                                                                                                         
=== START OF READ SMART DATA SECTION ===                                                                                                                                 
SMART STATUS RETURN: incomplete response, ATA output registers missing                                                                                                   
SMART overall-health self-assessment test result: PASSED                                                                                                                 
Warning: This result is based on an Attribute check.                                                                                                                     
                                                                                                                                                                         
General SMART Values:                                                                                                                                                    
Offline data collection status:  (0x82) Offline data collection activity                                                                                                 
                                        was completed without error.                                                                                                     
                                        Auto Offline Data Collection: Enabled.                                                                                           
Self-test execution status:      (   0) The previous self-test routine completed                                                                                         
                                        without error or no self-test has ever                                                                                           
                                        been run.                                                                                                                        
Total time to complete Offline                                                                                                                                           
data collection:                (44188) seconds.                                                                                                                         
Offline data collection                                                                                                                                                  
capabilities:                    (0x5b) SMART execute Offline immediate.                                                                                                 
                                        Auto Offline data collection on/off support.                                                                                     
                                        Suspend Offline collection upon new                                                                                              
                                        command.                                                                                                                         
                                        Offline surface scan supported.                                                                                                  
                                        Self-test supported.                                                                                                             
                                        No Conveyance Self-test supported.                                                                                               
                                        Selective Self-test supported.                                                                                                   
SMART capabilities:            (0x0003) Saves SMART data before entering                                                                                                 
                                        power-saving mode.                                                                                                               
                                        Supports SMART auto save timer.                                                                                                  
Error logging capability:        (0x01) Error logging supported.                                                                                                         
                                        General Purpose Logging supported.                                                                                               
Short self-test routine                                                                                                                                                  
recommended polling time:        (   1) minutes.                                                                                                                         
Extended self-test routine                                                                                                                                               
recommended polling time:        ( 737) minutes.                                                                                                                         
SCT capabilities:              (0x003d) SCT Status supported.                                                                                                            
                                        SCT Error Recovery Control supported.                                                                                            
                                        SCT Feature Control supported.                                                                                                   
                                        SCT Data Table supported.                                                                                                        
                                                                                                                                                                         
SMART Attributes Data Structure revision number: 16                                                                                                                      
Vendor Specific SMART Attributes with Thresholds:                                                                                                                        
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                                                         
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0                                                                                 
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       103                                                                               
  3 Spin_Up_Time            0x0007   169   169   024    Pre-fail  Always       -       424 (Average 401)                                                                 
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       38                                                                                
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0                                                                                 
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0                                                                                 
  8 Seek_Time_Performance   0x0005   113   113   020    Pre-fail  Offline      -       42                                                                                
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       24341                                                                             
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0                                                                                 
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35                                                                                
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       719                                                                               
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       719                                                                               
194 Temperature_Celsius     0x0002   125   125   000    Old_age   Always       -       48 (Min/Max 18/63)                                                                
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0                                                                                 
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0                                                                                 
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0                                                                                 
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       10

I don't like the high temps (I always like to keep my spinning drives under 40 C), but it's an external drive in a sealed box with no fan, so there's no a lot I can do there.

chrism01 · 11-12-2015, 06:23 PM

I guess I'd add some extra code or write a simple monitor to keep an eye on the rsyncs and see if it starts occurring again.

It may(!) be that the disk is in the very early stages of dying, so basically ok but starting to lose integrity/reliability.
If it does happen (more than once), you could try backing up everything (manually of course

) and then re-running mkfs.blah on that disk.
I would then keep an eye on it eg cron smartmon or some such.

Of course all the above is pure speculation

suicidaleggroll · 12-03-2015, 11:56 AM

Well it did it again, a few days and ~8 successful backups since I "fixed" it before. After I "fixed" it before I also moved the backups to a brand new drive, and it was on that drive that it failed this time, so I think drive failure can be safely eliminated.

I have a feeling it's related to ACLs on the files in /var/log/journal (just a few files after /var/lib where it was stopping before). Since day 1 it's been printing out ACL warnings during the backup of those files (mostly harmless, just unable to set ACL permissions or transfer ACL permissions or something or other), maybe something with the ACLs was locking up the read on the destination when trying to check the --link-dest?

I don't need those logs backed up anyway, so I excluded them from the backup which stopped the warnings, we'll see if the hangups come back.