LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Files seem to take up more space in destination after rsync copy (http://www.linuxquestions.org/questions/linux-newbie-8/files-seem-to-take-up-more-space-in-destination-after-rsync-copy-787964/)

Karderio 02-09-2010 04:44 AM

Files seem to take up more space in destination after rsync copy
 
I have recently purchased an external hard drive in order to backup my home partition. In my PC I have a "1.5T" drive with several partitions on it, containing OSes and the home partition. The home partition is 1.3T according to df, the external drive contains one partition that spans the entire disk,df reports it as 1.4T in size. Both partitions are ext3.

When I use rsync to copy files from the home partition to the external partition, the external disk becomes full, despite the destination - supposedly - being larger than the source. I don't understand why copying files from one partition to a slightly bigger partition should need more space than on the source partition. Does anyone know what is happening ?


Details :

I created the partition on the external drive with gparted; gparted reported it the already have several gigabytes in used space immediately after the partitions creation - I thought at the time that this must be normal.

The home partition contains many files of all sorts, including lots of big audio and video files. If you are wondering, for all my important files this external disk is only secondary backup, as they are also backed up to the "internet".

These are the mount points :
/mnt/tmp/ : home partition, /dev/sdb6
/mnt/external/ : external partition, /dev/sdc1

I used rsync to copy the files, I know there are more efficient ways to do this, but I wanted to use the same command that I will subsequently run to sync the backup.
rsync -av --progress --stats --recursive --perms --links --delete /mnt/tmp/ /mnt/external/

Next I tried adding the --sparse switch, as I was wondering if the problem may come form sparse files. I don't know however if rsync would go back and shrink the sparse file by just adding the switch and executing the command. I also added --one-file-system, for good measure. Here is what I ran next :
rsync -av --progress --stats --sparse --one-file-system --recursive --perms --links --delete /mnt/tmp/ /mnt/external/

I tried an fsck on the home partition :
fsck -f /dev/sdb6

This is the output from the last rsync :
rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
rsync: write failed on "abcd.avi": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.6]
rsync: connection unexpectedly closed (27886 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]

Looking at the destination after a partial copy seems to indicate that the problem is not symbolic links being "expanded". I have not checked the source filesystem for sparse files, nor the destination to see if these files could be larger there, as this does not seem trivial.

Here is some additional info :

$ df /mnt/tmp/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb6 1415342836 1414173740 369096 100% /mnt/tmp

$ df /mnt/external/
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc1 1442145212 1441851736 293476 100% /mnt/external


Thank you !

Karderio 02-09-2010 05:06 AM

The sparse file hypothesis
 
I just explored the sparse file possibility, and this does not seem to be the issue.

To discover sparse files in the source, I used a script from here :
http://forums13.itrc.hp.com/service/...readId=1065891

The Wikipedia article on sparse files explains how to distinguish between apparent and actual file sizes :
http://en.wikipedia.org/wiki/Sparse_file

So having identified a sparse file on the source, I ran :
# du -s -B1 --apparent-size '/mnt/tmp/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat'
63475 /mnt/tmp/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat
# du -s -B1 '/mnt/tmp/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat'
69632 /mnt/tmp/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat

Compared with the same file on the destination :
# du -s -B1 --apparent-size '/mnt/external/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat'
63475 /mnt/external/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat
# du -s -B1 '/mnt/external/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat'
69632 /mnt/external/chris/.openoffice.org/3/user/registry/cache/org.openoffice.Office.UI.WriterCommands.dat

Identical. So I would say that sparse files are preserved, so my problem does not arise from this.

Karderio 02-09-2010 05:07 PM

Solved...
 
In actual fact, it would seem that the problem was sparse files after all.

I had quite a bit of trouble determining that this was actually the case though, I ended up hacking together two scripts to solve my problem, and without the second I think I would not have been able to solve the issue without erasing the entire destination disk and starting anew.

I first tried a diff, to see what differed from the source to the destination :
Code:

diff -rq /mnt/tmp/ /mnt/external/
Let it be said that a diff on more than a terabyte of data takes a very long time, I stopped this after about five hours.

Next, I made a script to determine if the backup files were of a different size from the source files (and to see what files were missing from the backup) :

Code:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import os
from os.path import join, getsize, exists

path1 = "/mnt/tmp/"
path2 = "/mnt/external/"

for root, dirs, files in os.walk(path1):
        for file in files:
                mirror_path = join(path2, root[len(path1):], file)
                file_path = join(root, file)
               
                if not exists(mirror_path):
                        print(file_path + " exists.")
                        print(mirror_path + " absent.")
                else:
                        if not getsize(file_path) == getsize(mirror_path):
                                print(file_path + " size : " + str(getsize(file_path)))
                                print(file_path + " size : " + str(getsize(mirror_path)))

It seemed that the files all had the same size from the source to the destination, and that there were just a few missing, as there was no space for them. I next inverted path1 and path2 to check that there were no extra files in the backup - there weren't.

So, I made a new script to compare the number of filesystem blocks used in the source and destination partitions :

Code:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import os
from os.path import join, getsize, exists

path1 = "/mnt/tmp/"
path2 = "/mnt/external/"

for root, dirs, files in os.walk(path1):
        for file in files:
                mirror_path = join(path2, root[len(path1):], file)
                file_path = join(root, file)
               
                if exists(mirror_path):
                        if not os.stat(file_path).st_blocks == os.stat(mirror_path).st_blocks:
                                print(file_path + " size : " + str(os.stat(file_path).st_blocks))
                                print(mirror_path + " size : " + str(os.stat(mirror_path).st_blocks))

It turns out that some files used up a lot more blocks in the backup ! Seems some files were sparse in the source, but not in the destination :-/

So I modified the last script to delete the offending files from the backup, I did another rsync, and presto, now the source and the backup are just about the same size !

Remarks :

1/ If you use the above code, beware that it seems to have a few issues with symlinks.

2/ I really feel that all this was overly complex. Shouldn't rdiff default to handling sparse files, or shouldn't adding the "--sparse" switch replace "regular" files in the destination with sparse files (this may not be trivial to implement mind you). At least mention sparse files and the woes they can cause in the rdiff docs...

3/ The script executes in under five minutes, a lot quicker than a full diff...

4/ I tend to ramble... maybe nobody is interested in my problems, maybe googleing this thread could help someone one day.

thund3rstruck 02-24-2012 04:00 PM

Quote:

Originally Posted by Karderio (Post 3858351)
I tend to ramble... maybe nobody is interested in my problems, maybe googleing this thread could help someone one day.

Umm... this post is outstanding.

I just ran an rsync operation and somehow my source directory which contains 118GB of files bloats up to 220GB after rsync is complete yet all the files look the same. I'm just starting my journey into this and I appreciate this post.

suicidaleggroll 02-24-2012 04:11 PM

Quote:

Originally Posted by thund3rstruck (Post 4611325)
Umm... this post is outstanding.

I just ran an rsync operation and somehow my source directory which contains 118GB of files bloats up to 220GB after rsync is complete yet all the files look the same. I'm just starting my journey into this and I appreciate this post.

Any chance your source directory contains a bunch of sym or hard links? That's the most common reason a copy blows up like that for me.

thund3rstruck 02-24-2012 08:30 PM

Quote:

Originally Posted by suicidaleggroll (Post 4611338)
Any chance your source directory contains a bunch of sym or hard links? That's the most common reason a copy blows up like that for me.

Actually it's a long story. Windows server failed but we had backups of all the data on a USB drive (formatted in NTFS) so we restored all that data to an ext3 Linux/Samba server. Then we wanted to resume backups using the existing NTFS drive we restored from and that's the one where rsync is doubling the file sizes.

We just completed a quick test and deleting the existing backup data from the usb drive and re-rsyncing it from scratch fixes the problem.


All times are GMT -5. The time now is 05:01 AM.