Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a backup script that syncs the entire filesystem on my laptop to a backup system on the LAN. It runs every night, and as soon as the rsync finishes, it renames the backup directory to <name>.complete and updates a "latest" symlink to point to it. When it runs the next time, it grabs the target of the symlink (the previous complete backup) and uses that as the argument to --link-dest.
The goal is that I have daily incremental backups with a symlink that always points to the most recent one. Any backups that fail part way through are left on the backup machine, but aren't renamed to .complete, don't get the symlink updated, and aren't used as the source for --link-dest on the next backup.
This backup system has been working flawlessly every night for the last ~3 months.
For the last three days, however, the backups have not completed. Today I decided to run the backup interactively to see what was happening, and oddly it simply stopped during the transfer as soon as it got to /var/lib/mlocate/mlocate.db. I went back and looked at the previous three backup attempts, and sure enough all of them got to mlocate and then stopped. As soon as rsync prints out that filename (using -v flag) all activity stops, the file does not get created/hardlinked on the destination, and the rsync process on the destination quits. It then sits there indefinitely, the rsync process on the laptop still running, the ssh connection on both sides still running, but CPU usage stuck at 0, no activity, and no mlocate.db gets created on the destination.
The file is only 8 MB and permissions are normal. Fearing some kind of corruption, I copied the file to my home directory on the laptop, which went fine, I re-ran updatedb, and ran locate a few times, all behaved fine. I then went into the backup machine and deleted mlocate.db from the most recent complete backup (the one that it's been trying to use with --link-dest), thinking that it was getting stuck on the comparison and without that file it would just copy it over from the laptop from scratch, but this changed nothing.
What could be going on? Why does it keep stopping at this exact file every time? Disk usage at the destination is only 20%, inode usage is 2%.
The laptop is on OpenSUSE Tumbleweed, the backup system is CentOS 7, both up to date.
A couple of thoughts / things I would try to debug further:
First, confirm that the issue is really related to that specific file.
--exclude the file from the script to see if rsync still gets stuck and, if so, see if it gets stuck right before or after the the point where it would have been the file's turn until now.
As an alternative to test the same thing you could manually rsync only that file (obviously using the same rsync options).
If the above confirms that this file is the cause for the issue, you could consider to just exclude it, as a backup of the mlocate database has no value imho (you can always rebuild if needed, plus when you need to recover data from the backup you probably won't even want to overwrite your current database with one from the backup).
If you want do analyze further just for the sake of understanding (I guess personally I would) you could run a watch on lsof for the file while rsyncing to see if another process is somehow locking it. (I'd watch on both machines, in case you actually can on the NAS).
And... sorry for stating the obvious... have you tried rebooting (both machines).
Lastly: A side comment: I am backing up my laptops and my wife's laptop with a similar script. I have had good success running the cronjob hourly, not daily, for the last ~ 3.5 years. Due to the high frequency each individual run only takes a couple of seconds and I don't have to make sure the laptops are running at specific times / over night...
And that's the end of it. When it got to that point both ssh and rsync on both sides were still running, but after about 10-20 seconds the rsync process on the backup machine exited. Now it's just sitting there, rsync and ssh running on the laptop, ssh running on the backup machine, with no activity and no output after the above. /var/lib/rpm/Basenames is ~12 MB, should take less than a second to transfer.
I've rebooted both machines since this problem started a couple of days ago.
Last edited by suicidaleggroll; 11-10-2015 at 02:58 PM.
Weird. Sorry for re-asking... have you tried rebooting? This all sounds a bit like stuck processes or so. Maybe an old ghost instance of rsync is still hanging around? (By the way I recommend using lock files for such scripts, I can post code if you are interested).
Do you have other machines you could use to test to narrow things down? E.g. test from another laptop with the NAS, and test from your laptop to another machine?
What network link are you using? If it's a wifi connection, consider testing ethernet.
Yeah both machines have been rebooted, today actually. There aren't any old rsync processes, there was when I got up this morning, but I shut it down and then rebooted the laptop, then rebooted the backup machine, then tried to re-run the backup by hand.
I could rsync from the laptop to the main server, maybe I'll try that tonight.
It's on wifi, but the connection is stable. I'm actually doing all of this work remotely. If the connection was hanging, then my ssh session where I'm monitoring the progress would be hanging too. Might as well plug it in when I back up to the other machine anyway though, it'll run a little faster.
I just went back and looked at my logs too. 11/7 at 2am was the last successful backup, and 11/7 at 6pm I did an update (standard zypper up) which included a few hundred packages. Neither rsync nor ssh were in the list of updates though, and neither were any of the libraries that are used by rsync, as near as I can tell. It did include a kernel update, among other things (4.2.3-1 to 4.3.0-1).
Last edited by suicidaleggroll; 11-10-2015 at 03:24 PM.
Lovely...I add strace to try and track down the problem, and it fixes the problem.
Six backup attempts in a row that all fail at the exact same point (except that one where I removed mlocate.db, where it failed a few files later) with the exact same symptoms. Then I add an strace to the call, and it runs to completion. I'm back to the original script now with no changes, and it still runs just fine. I can't reproduce the problem. Six successful backups in a row now.
You know when you're writing some software and you run into a seg fault, so you add some print statements to track it down, and it stops seg faulting? I have a coworker who jokingly refers to those as "load-bearing print statements". Apparently that was a load-bearing strace...
Lovely...I add strace to try and track down the problem, and it fixes the problem.
Six backup attempts in a row that all fail at the exact same point (except that one where I removed mlocate.db, where it failed a few files later) with the exact same symptoms. Then I add an strace to the call, and it runs to completion. I'm back to the original script now with no changes, and it still runs just fine. I can't reproduce the problem. Six successful backups in a row now.
You know when you're writing some software and you run into a seg fault, so you add some print statements to track it down, and it stops seg faulting? I have a coworker who jokingly refers to those as "load-bearing print statements". Apparently that was a load-bearing strace...
Interesting indeed. Glad to hear your problem is solved, although the solution is a bit unsatisfying. One last thing you may want to do is run a memtest and an extended hard self test of the hard drives with smartmontools.. When a system behaves that weird that rings a bell in my head that the problem might hardware-related.
It sounds like editing and saving the script solved the problem. Could it be that the script got corrupted in some not-so-obvious way, like a bad byte that doesn't display, and the editor corrected it?
If your backups over the last few days succeeded in transferring the script file, you might be able to discover the corruption by comparison (using cmp or md5sum) of the script file to its backed-up versions over the last several days.
Actually I think I have an idea what the problem was (not exactly, but it makes more sense than a load-bearing strace).
I added the strace to the rsync command in the script, then ran the script in a screen session. It hung like the previous ones did, but unfortunately screen screwed up and wouldn't let me scroll back to see the relevant output, just the last couple of lines that said it had timed out. No worries, I just Ctrl+C'd the rsync like normal, then ran the backup script again in the foreground so I could actually see the output. This time it ran to completion (along with all subsequent runs).
I was thinking that adding the strace fixed it, but what I think happened is the first one ran and hung, and when I hit Ctrl+C I killed rsync and then strace finished normally. The script saw the 0 exit code from strace, assumed rsync finished correctly, and renamed the directory and updated the symlink as if the backup completed successfully. The next time I ran the script it used that one as --link-dest, rather than the backup from 11/7, and when it reached the point where the previous one had hung it just copied over everything from the laptop fresh (since it was hanging in /var/lib, it was basically at the end of the backup anyway, since rsync runs alphabetically).
So I think there was some file in the 11/7 backup that was corrupt, and rsync was crashing/hanging whenever it tried to access it. I guess the problem file was just a few files ahead of what it was printing...maybe rsync does some read-ahead to accelerate the transfer? So deleting the file it last printed before hanging didn't change anything, but maybe if I had just deleted everything after that file alphabetically it would have been fine.
If I get a bee in my bonnet I might switch the symlink back to the 11/7 backup and do more investigation into what file exactly was triggering the problem.
Interesting. I guess if I was in your shoes I'd smart selftest the hard drive on the NAS.
Also you could probably cp -rv the 11/7 backup to /dev/null and hope to see a read error on the corrupt file.
This is a very interesting thread.
I do think this could be true - it makes sense - in fact I'd be surprised if it didn't..
Quote:
maybe rsync does some read-ahead to accelerate the transfer
Having read the entire thread, I do think that its most likely corrupt file blocks in the (src) target files, not the script itself.
I would crank up the smartmon tool and take a look.
It is an old drive, but no reallocated sectors or anything else out of the ordinary.
Code:
# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.20.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Hitachi Deskstar 5K4000
Device Model: Hitachi HDS5C4040ALE630
Serial Number: PL2311LAG1G9PC
LU WWN Device Id: 5 000cca 22ec0aa19
Firmware Version: MPAOA3B0
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5700 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Nov 12 08:57:04 2015 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (44188) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 737) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 134 134 054 Pre-fail Offline - 103
3 Spin_Up_Time 0x0007 169 169 024 Pre-fail Always - 424 (Average 401)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 38
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 113 113 020 Pre-fail Offline - 42
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 24341
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 35
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 719
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 719
194 Temperature_Celsius 0x0002 125 125 000 Old_age Always - 48 (Min/Max 18/63)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 10
I don't like the high temps (I always like to keep my spinning drives under 40 C), but it's an external drive in a sealed box with no fan, so there's no a lot I can do there.
Last edited by suicidaleggroll; 11-12-2015 at 10:01 AM.
I guess I'd add some extra code or write a simple monitor to keep an eye on the rsyncs and see if it starts occurring again.
It may(!) be that the disk is in the very early stages of dying, so basically ok but starting to lose integrity/reliability.
If it does happen (more than once), you could try backing up everything (manually of course ) and then re-running mkfs.blah on that disk.
I would then keep an eye on it eg cron smartmon or some such.
Well it did it again, a few days and ~8 successful backups since I "fixed" it before. After I "fixed" it before I also moved the backups to a brand new drive, and it was on that drive that it failed this time, so I think drive failure can be safely eliminated.
I have a feeling it's related to ACLs on the files in /var/log/journal (just a few files after /var/lib where it was stopping before). Since day 1 it's been printing out ACL warnings during the backup of those files (mostly harmless, just unable to set ACL permissions or transfer ACL permissions or something or other), maybe something with the ACLs was locking up the read on the destination when trying to check the --link-dest?
I don't need those logs backed up anyway, so I excluded them from the backup which stopped the warnings, we'll see if the hangups come back.
Last edited by suicidaleggroll; 12-03-2015 at 11:57 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.