[SOLVED] Hot Swapping SATA Drives

lpallard · 09-20-2011, 06:06 PM

Here we go... first major problem: The first trial having successfully completed, I continued experimenting with a second drive and a second rsnapshot file.. In other words, a totally different setup.

All went fine until rsync said that there were no space left on the device... and then aborted.

What confuses me is the following:

Size of the partition to backup to the external drive (source partition):

Code:

df -H
Filesystem             Size   Used    Avail    Use%    Mounted on
/dev/sdd5              751G   162G    590G     22%     /mnt/mass-storage

And the size of the external drive's partition (destination partition):

Code:

df -H
Filesystem             Size   Used    Avail    Use%    Mounted on
/dev/sdg5              251G   251G    27M      100%    /mnt/backups

At first I thought the 162G of data would fit in the 251G partition, therefore I started the script and I was not expecting rsync to run out of space. But then this happened and when I went to sum the file sizes in the source partition, I did not get a total of 162G as reported by "df -H" but 236.6GB... Still, this *should* fit in the destination mount point of 251GB.

So 2 questions:

Why is df -H not reporting the proper stats?

Why is rsync running out of space, i.e. why 236.6 doesnt fit in 251 ???

Whats going on?!? I'm speechless...

Woodsman · 09-20-2011, 06:53 PM

I forget the make and model number of the removable drive bay I use, but the green power LED is on all the time, even after I power down the drive. The lsscsi command says the drive is no longer available. I have been removing my drives this way for a few years now with no known adverse effects.

The difference between 236.6 GB and 251 GB is 5%, which ext3 partitions reserve for emergency use.

The partition might be using 162 GB, but if the file system, which is not the same, has any mount points then those mount points get copied too. For example, if you hae a separate partition mounted at /home, and you tell rsnapshot, which uses rsync, to copy /, then /home gets copied too although the / partition might be 162 GB. You have to create an exlusion list if you want to exclude such mount points.

lpallard · 09-20-2011, 07:53 PM

Yeah I picked up the 5% difference just after I wrote my last post here.... At least I was on the right track!

Now I just ran reiserfsck manually and it said that my tree was damaged and it was necessary to re-run the check with the --rebuild-tree argument which I did. Its still running as we "speak"...

That might explain the problems... Once reiserfsck is done, I will retry the script and see if it works.

What I am wondering is how come the check performed in your script did not pickup this problem?

Woodsman · 09-20-2011, 08:09 PM

Which problem?

lpallard · 09-20-2011, 09:10 PM

I forgot to mention that at first run, reiserfsck checked the destination HDD and did not report anything unusual... Then I perform the backup, it fails because HDD full, and I perform a manual checkup with reiserfsck and it picks up the corrupted inode tree... I repair using --rebuild-tree and I am right now re-trying the backup...

What I meant by "problem" is the corrupted tree not picked up by reiserfsck at first. I wonder if the corruption did not happen *during* the backup...? Maybe a bad drive?

Lets see what happens next!

lpallard · 10-07-2011, 06:38 AM

Hey Woodsman,

I played around a bit more with your scripts and I am still having space problems on the destination drive. I however tried to troubleshoot a bit and found interesting facts:

If I "rsnapshot" from a partition, folder, mnount point, whatever you want to call this, that is simply a collection of folders & files, then if the source is smaller than the destination, all qill be fine.

On the other hand, if the source is composed of already "rsnapshot'ed" files, even if df -H or a folder property (right click) says that the source is smaller than the destination, it will not work.

Some examples to illustrate what I am saying:

Scenario #1:

Source:

/mnt/test/
------|-file1
------|-file2
------|-file3
------|-file4
--------------------
Total: 10GB

Destination:
/mnt/hotswaphdd/
--------------------
Total: 10+GB

Will work.

Scenario #2:

/mnt/test/
------|-weekly.0
----------|-file1
------|-weekly.1
----------|-file1
------|-weekly.2
----------|-file1
------|-weekly.3
----------|-file1
--------------------
Total: 10GB (as per the XFCE right click option on the folder /mnt/test/ & also from df -H)

Destination:
/mnt/hotswaphdd/
--------------------
Total: 10+GB

Will NOT work.

TO me, it looks like even if the OS & file utilities are reporting a certain folder size, the fact that (in the second scenario) the folders are rsnapshots of another FS, it seems like rsnapshot once trying to rsnapshot to a different location (the hotswap HDD) does not understand that the source is only hardlinks and not real files...

In my case, the weekly.0, .1, .2..... are all around 120GB each. The destination is 1 terabyte. SO 120GB x 4 = 480GB (roughly) so it *should* fit in 1TB partition, but when I run rsnapshot to copy these 4 weekly.* folders to a hotswap HDD the destination partition gets filled completely and rsnapshot aborts...

In conclusion, there is something weird in the way rsnapshot backups a backup...

Alien Bob · 10-07-2011, 07:15 AM

I think your copy action failed to preserve the "hardlinks" which rsnapshot uses to create incremental system backups that almost take up no space at all. I do not think that the "cp" command is able to preserve hard links at all. And the "rsync -a" command does not preserve hardlinks either, unless you add a "-H" parameter.

Eric

Woodsman · 10-07-2011, 11:48 AM

Quote:

In my case, the weekly.0, .1, .2..... are all around 120GB each. The destination is 1 terabyte. SO 120GB x 4 = 480GB (roughly) so it *should* fit in 1TB partition, but when I run rsnapshot to copy these 4 weekly.* folders to a hotswap HDD the destination partition gets filled completely and rsnapshot aborts...
In conclusion, there is something weird in the way rsnapshot backups a backup...

Rsnapshot is widely used and if rsnapshot was faulty there are many people far more attuned and smarter than me who would have found those flaws.

I'll take a guess at what you are trying to do.

You mentioned copying 4 weekly.* folders to your hot swap drive. If you are trying to duplicate my layered backup process, then I presume you are talking about backing up to your hot swap drive the file system of backups that are created every three hours.

Those 3-hour backups are performed automatically through cron. I use rsnapshot to backup certain important user and system files to a second internal hard drive. The only purpose for those backups is to protect me from myself when I tinker and goof. I like to tinker and yes, tinkering means needing backups to repair the goofs.

I have used those backups many times to restore files.

My manual weekly backup with a hot swap drive is a traditional full system backup (minus files and directories in my exclusion list). Like the 3-hour backups, I use rsnapshot for those weekly backups.

I use a 750 GB drive to store my weekly backups. I have 26 weeks of weekly backups of my primary system, which has two 320 GB drives running at about 60% capacity and 6 months of monthly backups for my remaining 3 systems. That is a lot of backups. Of course, I am able to do this through the magic of rsnapshot, which uses rsync and hard links.

With that background explanation, here is the key where I think you are experiencing problems. In my weekly backup I never backup those 3-hour backups. The entire file system containing those 3-hour backups are part of my exclusion list for my manual weekly backups.

I never tried to copy my 3-hour backups to another drive. I don't want to because those 3-hour backups serve a unique purpose.

I don't know how rsnapshot works with trying to backup rsnapshot file systems, which is what I think you are trying to do. Rsnapshot is designed to backup normal files systems using rsync and hard links, not backup or copy other rsnapshot file systems.

If I have guessed correctly in what you are trying to do, then the solution is simple: don't backup the 3-hour backups when you perform the manual weekly backup with a hot swap drive. Exclude those files.

I use text files to create my exclusion lists for all of my rsnapshot backups.

lpallard · 10-08-2011, 05:00 PM

Quote:

I don't know how rsnapshot works with trying to backup rsnapshot file systems, which is what I think you are trying to do.

Couldn't be more accurate. This is *exactly* what I am trying to do. You see, I use rsnapshot to backup important files on a weekly basis. Namely my laptop which automatically mount a nfs share, rsnapshot to my server and unmount the nfs share. This is done via CRON every Sunday night at 8PM. I keep 4 weekly backups. This gives me enough "back in time" functionality if I realize I deleted something accidentally. SO far (2 years) it worked flawlessly.

What I am trying to do with your hotswap script is to backup the NFS share (where my laptop's fs snapshots are stored) to a hotswap HDD. This is to prevent catastrophic loss of my data (the HDD where the NFS mount is located dies, FS corruption, my server blows up, aliens steal my server, etc) & simultaneously my laptop dies. Very remote probability but having an offline/unpowered copy of my data reassures me. Plus, if anything happens, I can un-dock the HDD from the hotswap enclosure and run with it.

I too have the feeling rsnapshot is confused the hell to backup a backup... somehow it must recopy the file every time it sees a hard link pointing to that file... So the source might be

weekly.0
120GB
weekly.1
8.4GB
weekly.2
7.6GB
weekly.3
14GB

it will end up being

weekly.0
120GB
weekly.1
120+GB
weekly.2
120+GB
weekly.3
120+GB

...

Maybe I need to rethink my strategy? Maybe Im paranoid?

lpallard · 10-08-2011, 05:05 PM

ON second thoughts, if I was to reduce the extent of the backup, and drop 3 folders out of the 4 backups, should I drop everything except weekly.0, or the opposite? Or is there some black magic to do with all 4 backups to get my files back from this kind of backup? I've never had to restore from rsnapshot (good for me!)

Woodsman · 10-08-2011, 07:11 PM

Quote:

Maybe I need to rethink my strategy?

There is a link-dest option in the rsnapshot configuration options, which is the same as that in rsync. Make sure that option is enabled in the configuration file.

Possible solution: Create a different rsnapshot configuration file for dumping the rsnapshot backups to your hot swap drive. (I use three different configuration files for my layered backup strategy.) Then use the rsync_short_args option, through which you set the -H option to maintain hard links. The default for rsync_short_args is rsync -a (-rlptgoD), which does not include the -H option.

If you are dumping the backups monthly, then set this new configuration file to only use intervals named monthly (retain monthly).

I have restored files from rsnapshot backups. Probably two or three times a year. I also use the backups to compare past versions of various files. All I do is select the desired rotation directory from which I want to restore or compare. Has always worked.

lpallard · 10-09-2011, 10:15 AM

woodsman,

link-dest seems to have fixed the problem. I also eliminated the monthly.1, .2 & .3 to backup only .0

Now I am having a strange issue and I believe you might have had the same problem since this issue appeared at the very beginning when I used your script. Maybe a bug or glitch in lsscsi?

After the backup is done, the script asks:

Code:

Do you want to remove the backup drive from the SCSI list? (y/n): y

If I answer right away Yes or No, no problems. If I am not around when the backup ends up and the script waits for a long period of time at the question above, whhen I answer Yes, it says:

Code:

/dev/sda5 does not seem to be connected!

But it is indeed there:

Code:

root@lhost2:~# lsscsi
[1:0:0:0]    disk    ATA      ST31500541AS     CC34  /dev/sda 
[4:0:0:0]    disk    ATA      ST3320620AS      3.AA  /dev/sdb 
[5:0:0:0]    disk    ATA      WDC WD3200AAKS-7 12.0  /dev/sdc 
[6:0:0:0]    disk    ATA      WDC WD10EADS-00L 01.0  /dev/sdd 
[7:0:0:0]    disk    ATA      WDC WD7500AYPS-0 02.0  /dev/sde 
[8:0:0:0]    disk    ATA      Hitachi HDS5C302 ML6O  /dev/sdf 
[9:0:0:0]    disk    ATA      ST32000542AS     CC34  /dev/sdg

To "disconnect" the drive from lsscsi, I need to run the rmbackup script manually (also note the sda is gone from the list):

Code:

Attempting to remove /dev/sda from the SCSI list.

/dev/sda

DEVICE=/dev/sda
HOSTS=1
CHANNELS=0
IDS=0
LUNS=0

Removing /dev/sda from SCSI list...
Continue? (y/n): y

Continuing.

Checking mount status...
Oops! Found /mnt/backup mounted at /dev/sda5. Unmounting...

This might take a few seconds...
Sun Oct  9 11:17:57 EDT 2011
Synchronizing disk cache.
Stopping (spinning-down) /dev/sda.
    Start stop unit command: 1b 01 00 00 00 00


[4:0:0:0]    disk    ATA      ST3320620AS      3.AA  /dev/sdb 
[5:0:0:0]    disk    ATA      WDC WD3200AAKS-7 12.0  /dev/sdc 
[6:0:0:0]    disk    ATA      WDC WD10EADS-00L 01.0  /dev/sdd 
[7:0:0:0]    disk    ATA      WDC WD7500AYPS-0 02.0  /dev/sde 
[8:0:0:0]    disk    ATA      Hitachi HDS5C302 ML6O  /dev/sdf 
[9:0:0:0]    disk    ATA      ST32000542AS     CC34  /dev/sdg 

The /dev/sda may be powered off.
The device cannot be restored to the scsi list without cycling power.

Done.

And I can safely power off the drive. Have you had this problem before?

Woodsman · 10-09-2011, 12:35 PM

Quote:

If I answer right away Yes or No, no problems. If I am not around when the backup ends up and the script waits for a long period of time at the question above, when I answer Yes, it says:

There are no timeout options in the script. The prompt should stay there forever. Many times I have not been physically at the computer when the backup completes. I never had the problem you describe.

Quote:

/dev/sda5 does not seem to be connected!

That message is from the rmbackup script, which is called from the backup script and removes the device from the scsi list.

The message you received is for a partition (sda5) and not a device (sda). Verify the contents of the $BACKUPDRIVE variable and why you are testing for a partition rather than a device. The rmbackup script expects a drive model and not a partition. For example, WD7500AAKS. I suspect when you modified the script for your needs you substituted a drive partition rather than drive model.

The rmbackup script sources that variable from a separate script library container (read the comments several lines previous). Your system is not set up the same exact way. Modify the rmbackup script to use the correct drive model rather than a partition.

These scripts are designed for hot swapping drives. The device node of the disk changes depending upon when the drive is powered on. For example, on my system, when the drive is powered on during a reboot, the drive will be assigned the /dev/sdc device node. When I power on the drive otherwise, normally the device node is /dev/sde. Yet that will change when I power up the drive just after removing a different swappable drive or concurrently have USB flash drives inserted. Then the device node might be /dev/sdf or /dev/sdg. Hence I use a drive model rather than a hard-coded device and partition.

As I mentioned previously, the script could be modified to compare the lsscsi list before and after powering on a drive to diff the new drive's device node. There are other ways the scripts could be made more robust and universal. Some day I might update the scripts to do that, but not any time soon --- too many other things to do.

For now the script expects a drive model and not a partition or device node.

lpallard · 10-09-2011, 03:08 PM

Exact: there is no timeout in the script.

Quote:

That message is from the rmbackup script, which is called from the backup script and removes the device from the scsi list.

I agree.

I also admit to have changed your scripts quite a bit, to use block ID's (for example 869858df-0cb5-4f7c-a36e-6628c7e27488) instead of drive models because at the moment I had two identical drives connected at the same time, the script panicked. I have 12 hard drives in that server. Most of them are Seagate & Hitachi's (I hate WD) so its bound to happen that two drives with identical model numbers will be online at the same time. The strategy of using model numbers did not work well with my current layout.

Code:

/dev/sda5 does not seem to be connected!

I cheated this output. In fact, since I am using blkid instead of model #'s, I assigned a tag to every partition and I am using this tag to represent the drive in the script (for operator interaction only, not in the actual logic of the code). When I cheated this output to post here, I screwed up and added a "5" which in reality is not there.

Should have been

Code:

Media Storage Backup does not seem to be connected!

Or for the machine

Code:

/dev/sda does not seem to be connected!

Quote:

I suspect when you modified the script for your needs you substituted a drive partition rather than drive model.

You guessed right Sir!

Quote:

The device node of the disk changes depending upon when the drive is powered on.

I think this is where the problem lies. Somehow, the device node must disappear or something else must happen for the script to lose the drive... Hot a big deal since it occcurs rarely and without great consequences. When it happens, I manually run the rmbackup script and this ones pick up the drive and remove it.

Quote:

Hence I use a drive model rather than a hard-coded device and partition

Good logic of you dont have similar drives and also if you are using whole drives instead of partitions.

I'll keep playing whit it and see what happens... Thanks woodsman for your hard work!

lpallard · 10-09-2011, 10:00 PM

I think I know what happened when the destination drives were getting full.... I've found that directories and files with spaces in their names were strangely copied to the destination drive...

On the source drive, the directory was 100GB but on the destination drive it was 179GB... Same amount of files and folders.

Are you aware of a problem for rsync copying source items with spaces in filenames?

EDIT: Well, I confirm my doubts above regarding the potential glitch or bug with rsync. I just tried to simply copy the data to the destination drive with "cp -r", nothing fancy. Worked as expected. So I conclude its a problem or a bug with rsync...