How to virtually join files (i.e. without directing cat to a new file)?

chadwick · 08-21-2009, 05:43 PM

I have a few large files that together form an image of a disk partition. In other words, if combined they would be the image of a partition, but it has been split so that each individual file is smaller.

I want to join them together in order to mount and view the contents of the partition, but I don't have enough space on my drive to cat them together and write them as a new file on the drive. Another benefit could be that if the end result is very large then you wouldn't have to wait forever for it all to be catted together.

It seems like such a simple thing to just virtually join them together without having to actually go through the process of using cat, but if I search for how to do this I don't find anything about it. Is there a simple way to do this?

rjlee · 08-21-2009, 09:54 PM

There seems to have been some discussion on this on the kernel mailing lists (http://lkml.indiana.edu/hypermail/li...02.3/0464.html) but it didn't amount to much.

If the files are big enough to hold in memory, then the solution is easy: just create a big tmpfs partition and cat the files together onto that. But I guess for something as big as a filesystem, that's not going to be the case.

It should be possible to do this using a simple program written on top of fuse. I started trying to rig up a simple program using perl's Fuse.pm module, but it turned out to be a bit more complicated than I thought. It seems to work for me, and it's largely based on the Fuse example class - but be warned: I haven't extensively tested it. I doubt that it will damage any data so long as you treat the virtual files as read-only, but it's not pretty, I make no promises about it being fast (it could be very slow for large files - or not) and there may be bugs in it that could result in bad data in the virtual file. Oh, and don't try and write to the file (i.e. no fsck).

To run this, you will need a directory named "temp" in your home directory (this is where the virtual file will go), and some software that you should be able to install through your package manager: the fuser kernel module (which you probably have already), the libfuser libraries, Perl, and the Fuse.pm module for Perl.

To install Fuse.pm, open a root shell and type "perl -MCPAN -e shell", then "install Fuse". You may be prompted for defaults, and you can exit when finished. On Ubuntu, that didn't work for me but there's a simple package you can install instead, use "sudo apt-get install libfuse-perl".

The array at the top of the script (just under where it says "my @files = ") contains the names of the files that make up the virtual file, in order. You will probably want to change this.

To mount the filesystem, just run the script - but be warned that this will lock your terminal, so have another one ready to access the files in. To unmount and free up the first terminal, use "fusermount -u ~/temp" (you should also run this if the script crashes for any reason to remove the mount point so you can start again).

Finally, here's the script:

Code:

#!/usr/bin/perl -w

use warnings;
use strict;
use Fuse qw(:all);

my @files = (
    "/tmp/one",
    "/tmp/two",
    "/tmp/three",
    );

# get size of files
my %filesize = map { $_ => sizeof($_) } @files;

my $totalfilesize = 0;
map { $totalfilesize += $filesize{$_} } @files;

warn "Starting with total file size $totalfilesize; do not modify files while filesystem mounted!";

sub sizeof {
    my $file = shift;
    my ($size) = (`wc -c $file` =~ /(\d+)/);
    return $size;
}
my (%files) = (
        '.' => {
                type => 0040,
                mode => 0755,
                ctime => time()-1000
        },
        catenated => {
                cont => "This is file 'b'.\n",
                type => 0100,
                mode => 0644,
                ctime => time()-1000
        },
);

sub filename_fixup {
        my ($file) = shift;
        $file =~ s,^/,,;
        $file = '.' unless length($file);
        return $file;
}

sub e_getattr {
        my ($file) = filename_fixup(shift);
        $file =~ s,^/,,;
        $file = '.' unless length($file);
        return -ENOENT() unless exists($files{$file});
        my ($size) = exists($files{$file}{cont}) ? length($files{$file}{cont}) : 0;
	$size = $totalfilesize;
        my ($modes) = ($files{$file}{type}<<9) + $files{$file}{mode};
        my ($dev, $ino, $rdev, $blocks, $gid, $uid, $nlink, $blksize) = (0,0,0,1,0,0,1,1024);
        my ($atime, $ctime, $mtime);
        $atime = $ctime = $mtime = $files{$file}{ctime};
        # 2 possible types of return values:
        #return -ENOENT(); # or any other error you care to
        #print(join(",",($dev,$ino,$modes,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks)),"\n");
        return ($dev,$ino,$modes,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks);
}
sub e_getdir {
        # return as many text filenames as you like, followed by the retval.
        print((scalar keys %files)."\n");
        return (keys %files),0;
}

sub e_open {
        # VFS sanity check; it keeps all the necessary state, not much to do here.
        my ($file) = filename_fixup(shift);
        print("open called\n");
        return -ENOENT() unless exists($files{$file});
        return -EISDIR() if $files{$file}{type} & 0040;
        print("open ok\n");
        return 0;
}

sub e_read {
        # return an error numeric, or binary/text string.  (note: 0 means EOF, "0" will
        # give a byte (ascii "0") to the reading program)
        my ($file) = filename_fixup(shift);
        my ($buflen,$off) = @_;
        return -ENOENT() unless exists($files{$file});
        if(!exists($files{$file}{cont})) {
                return -EINVAL() if $off > 0;
                my $context = fuse_get_context();
                return sprintf("pid=0x%08x uid=0x%08x gid=0x%08x\n",@$context{'pid','uid','gid'});
        }
        return -EINVAL() if $off > $totalfilesize; #length($files{$file}{cont});
        return 0 if $off == $totalfilesize; #length($files{$file}{cont});
	my ($o, $i);
	$o = $off; $i = 0;
	while ($o > $filesize{$files[$i]}) {
	    $o -= $filesize{$files[$i]};
	    $i++;
	}
	my $read = 0;
	my $offset = $off;
	my $rtn;
	my $ret = "";
	# Read up to min($buflen,$totalfilesize-$off) bytes
	while ($read < $buflen && $read+$off < $totalfilesize && defined $files[$i]) {
	    open IN, "<$files[$i]" or return -EINVAL;
	    seek IN, $offset, 0;
	    my $r = read IN, $rtn, $buflen - $read;
	    $read += $r;
	    close IN;
	    $o -= $filesize{$files[$i]};
	    $i++;
	    $offset = 0; # one iteration per file, so next iteration reads from the start
	    $ret = $ret . $rtn;
	}
	return "$ret";
}

sub e_statfs { return 255, 1, 1, 1, 1, 2 }

Fuse::main(
    "mountpoint" => "$ENV{HOME}/temp",
    "getattr"=>"main::e_getattr",
    "getdir" =>"main::e_getdir",
    "open"   =>"main::e_open",
    "statfs" =>"main::e_statfs",
    "read"   =>"main::e_read",
    );

TimothyEBaldwin · 08-23-2009, 03:12 PM

If the joins are reasonably aligned (512 bytes?) use losetup to map them to loop devices than use dmsetup to join the block devices.