copying files as sparse

Skaperen · 08-30-2011, 12:43 PM

Both cp and rsync have options to make files sparse as they copy them. What I am finding is that they are not as effective at this as possible. I have a file that has a few data bytes non-zero in the first 512 bytes, and all binary zeros through the remainder of its 4194304 byte size. On a 4K ext4 filesystem, it occupies 4K allocated. Copied by cp --sparse=always, it occupies 32K. Copied by rsync -S it occupies 8K. If I truncate it to 512 bytes then truncate it back to 4194304 bytes, it occupies 4K, and the contents remain the same.

So I'm looking for something better than cp or rsync to make files sparse. I see no reason something can't go all the way, in this case to 4K. Or do I need to implement this myself?

Code:

lorentz/root /home/root 269# ls -dl foo
-rw-r--r-- 1 root root 4194304 Aug 30 13:30 foo
lorentz/root /home/root 270# cp -pv --sparse=always foo foo-cp
`foo' -> `foo-cp'
lorentz/root /home/root 271# rsync -aSvW foo foo-rsync
sending incremental file list
foo

sent 4194883 bytes  received 31 bytes  8389828.00 bytes/sec
total size is 4194304  speedup is 1.00
lorentz/root /home/root 272# cat foo > foo-trunc
lorentz/root /home/root 273# truncate -s 512 foo-trunc
lorentz/root /home/root 274# truncate -s 4194304 foo-trunc
lorentz/root /home/root 275# md5sum foo foo-cp foo-rsync foo-trunc
7ebe5061c40a2236c28d041e541bf034  foo
7ebe5061c40a2236c28d041e541bf034  foo-cp
7ebe5061c40a2236c28d041e541bf034  foo-rsync
7ebe5061c40a2236c28d041e541bf034  foo-trunc
lorentz/root /home/root 276# du foo foo-cp foo-rsync foo-trunc
4	foo
32	foo-cp
8	foo-rsync
4	foo-trunc
lorentz/root /home/root 277#

Tinkster · 08-30-2011, 02:00 PM

Code:

tar --sparse -c lastlog | tar --sparse -x -C /tmp
du -k /var/log/lastlog 
12      /var/log/lastlog
du -k /tmp/lastlog 
8       /tmp/lastlog

LOL - this seems to work better than expected ;D

Cheers,
Tink

raskin · 08-30-2011, 04:06 PM

echo foo | cpio --sparse -p /path/to/target/dir/ could work, too

Tinkster: no wonder, it can find new blocks that were previously nonzero.

Tinkster · 08-30-2011, 05:15 PM

Quote:

Originally Posted by raskin

echo foo | cpio --sparse -p /path/to/target/dir/ could work, too

Tinkster: no wonder, it can find new blocks that were previously nonzero.

Thinking about that ...

Cheers,
Tink

raskin · 08-30-2011, 11:17 PM

Tinkster: try using it on non-sparse file obtained by dd of 1M zeros from /dev/zero. Original file is non-sparse, most methods discussed here will produce sparse results. cpio man page explicilty states it searches simply for zero blocks.

Tinkster · 08-31-2011, 12:36 AM

Quote:

Originally Posted by raskin

Tinkster: try using it on non-sparse file obtained by dd of 1M zeros from /dev/zero. Original file is non-sparse, most methods discussed here will produce sparse results. cpio man page explicilty states it searches simply for zero blocks.

I'm just wondering how/why the du utility doesn't get it right on the original.

Cheers,
Tink

phil.d.g · 08-31-2011, 05:16 AM

To expand on what raskin said. If you have a sparse file and write some data somewhere in that file it allocates blocks for that data. If you were then to replace the data with zeroes, the blocks don't get unallocated, at least on ext3.

So your original file used to have some data in some areas, and that data has sinced been replaced with zeroes. When you do a sparse copy, if there are allocated blocks containing zeroes, it won't allocate blocks in the new copy. Hence "no wonder, it can find new blocks that were previously nonzero"

Skaperen · 08-31-2011, 02:09 PM

Quote:

Originally Posted by phil.d.g

To expand on what raskin said. If you have a sparse file and write some data somewhere in that file it allocates blocks for that data. If you were then to replace the data with zeroes, the blocks don't get unallocated, at least on ext3.

So your original file used to have some data in some areas, and that data has sinced been replaced with zeroes. When you do a sparse copy, if there are allocated blocks containing zeroes, it won't allocate blocks in the new copy. Hence "no wonder, it can find new blocks that were previously nonzero"

Maybe there is an ioctl() that can tell the filesystem to specifically unallocate a range. If not, it would be nice to add one. There is such an ioctl() for devices that support discarding blocks, generally used for solid state devices with a wear leveling layer, and perhaps also used in virtual machines engines that operate with an underlying compacted device file. I wonder what this would do on loopback block device/files (ideally, doing a discard on the loopback block device should be passed back to the loopback file to unallocate).

Of course, this all depends on the underlying filesystem actually supporting sparse files and having added support for sparsifying blocks in existing files.

A function in the library layer could be added to try the discard/unallocate, if that fails because not implemented, just pwrite() zeros there instead.