extracting tar archive over a network

ta0kira · 06-24-2009, 08:25 PM

I have a tar archive. I need to be able to extract a subset of the files in a way that allows them to be sent over ssh to another machine to be saved there. As an example, I created the archive with something like this:

Code:

tar -c files... | ssh user@server save-tar.sh #<-- determines the archive name and location

What I need is the reverse of that process without having to temporarily extract the files where the archive is. Unfortunately I can't just cat the archive across the network. For one, because I'm not extracting all of it (as little as a few MB from a 300GB+ archive.) Secondly, because the decision as to which archive a file will be extracted from is made on the same machine that stores the archive.

Essentially, I need to extract files from the archive into another archive (-xc is pretty much what I'm looking for,) which is sent to standard output. Thanks.
Kevin Barry

Hko · 06-25-2009, 03:22 AM

Since GNU tar (at least my version) does not accept -x and -c options at the same time it is simply just not possible (with GNU tar) to do without temporarily extracting the files.

What you would need is some kind tar-file-converter or some kind of enhanced tar program that support -x and -c together.

With some luck such program may already exist. Or it might even be doable to code one yourself.

The simplest solution would be to accept that you just need to temporarily extract the file from the tar archive. Like you said, it is just a few Mb's.

ta0kira · 06-25-2009, 05:18 AM

Quote:

Originally Posted by Hko

The simplest solution would be to accept that you just need to temporarily extract the file from the tar archive. Like you said, it is just a few Mb's.

I might actually write the code myself. The problem with extracting the files temporarily is I'd need to run the server-side script as root, which would either require passwordless sudo or RSA login for the root account; otherwise I'd lose all of the ownership attributes of the files. "Just a few MB" is the minimum (to illustrate how absurd it would be to transmit the archive); however, it could be the entire archive at once.
Kevin Barry

Hko · 06-25-2009, 06:38 AM

Quote:

Originally Posted by ta0kira

I might actually write the code myself.

Not easy I suppose. Would be pretty cool IMHO if you managed to do that, especially if the GNU project accepts your patch to have GNU tar support -xc :-)

Quote:

Originally Posted by ta0kira

The problem with extracting the files temporarily is I'd need to run the server-side script as root, which would either require passwordless sudo or RSA login for the root account; otherwise I'd lose all of the ownership attributes of the files.

I don't know the details, but I believe a program like "fakeroot" can fix that. BTW IIRC "fakeroot" is a debian specific tool, but there are others...

Quote:

Originally Posted by ta0kira

"Just a few MB" is the minimum (to illustrate how absurd it would be to transmit the archive); however, it could be the entire archive at once.

Hmm yes, that is another issue. It will quite an overhead to un-tar and tar those, especially in those cases where in the end the entire file will cross the network anyway. But that will be the case either way whether temporarily extracting or not.

Good luck

ta0kira · 06-25-2009, 07:07 AM

Quote:

Originally Posted by Hko

Not easy I suppose. Would be pretty cool IMHO if you managed to do that, especially if the GNU project accepts your patch to have GNU tar support -xc :-)

I don't think there's much to it since an archive is merely a collection of archives. I think I just need to jump from one header to the next and write the header + data section to standard out for each file I want to extract. In fact <tar.h> contains the header structure already. I think the hardest part might be selecting the files using a list, but I'm considering a ternary search tree for that. Or, depending on how convoluted the tar code is, I might just be able to have it print the header as a modification of -xO. I'll let you know. Thanks.
Kevin Barry

Hko · 06-25-2009, 11:52 AM

Quote:

Originally Posted by ta0kira

In fact <tar.h> contains the header structure already.

Ah. I wasn't aware that tar.h exists. On my system (ubuntu 9.04) it only contains a bunch of macros to define some constants and bitmasks though.

You may be already aware of it, but now I also found a "man 5 tar" which has quite a lot of info about all the different tar formats. On ubuntu (so probably debian as well) it is in the package "libarchive1".

When I checked which other file this packages contains, I found "man 5 libarchive-formats" which I think may useful to you.

ta0kira · 06-25-2009, 03:08 PM

Thanks for the help. It was pretty easy once I got a handle on the format. Most of my work was realizing that headers and files are padded to 512-byte blocks. I submitted the patch just now. I also attached the src portion of the patch below (against the most recent git revision.) I'll be adding the hacked version to my system under a suffixed name.
Kevin Barry

Hko · 06-25-2009, 06:12 PM

Wow. You got yourself familiar pretty quickly with the tar sources. I'm impressed.

Quote:

Originally Posted by Your patch submission message

Code:

 tar -xOQf huge.tar smallfile | ssh me@client tar -x

But what I don't get is: what is "smallfile" doing in this command line? Wasn't it the idea that the "smallfile" would be serialized by tar to send it over the wire and never exist on the filesystem?

Am I missing something?

ta0kira · 06-25-2009, 08:47 PM

Quote:

Originally Posted by Hko

But what I don't get is: what is "smallfile" doing in this command line? Wasn't it the idea that the "smallfile" would be serialized by tar to send it over the wire and never exist on the filesystem?

smallfile is contained within the archive; without -OQ it would be saved to the file system, and without -Q only the file's contents would be sent to standard output. If you were to list several files for extraction without -Q you'd be unable to distinguish between them. The option I added first outputs the file's archive header, then the file's contents (both padded %512 with 0x00.) This gives you a new archive that's a subset of the original archive. For example:

Code:

bash> tar -c directory | ssh user@server "cat > huge.tar"          #<-- archive to server
bash> ssh user@server tar -xOQf huge.tar directory/myfile | tar -x #<-- restore from server

Kevin Barry

PS As another example:

Code:

bash> tar -c /usr > usr.tar
bash> tar -xOQf usr.tar usr/local > usr-local.tar

PPS This actually doesn't work for other-than regular files, so I'll have to take another look. I don't think it will be much of a problem, though.

P3S It works for all file types now. See the revised attachment above, or go to the patch request.

Hko · 06-26-2009, 03:19 AM

Thanks for the explanation. I get it, and tried it on a small test.tar and it worked. Nice!

It sounds like you're using tar-archives as an inefficient remote filesystem though.. But I assume you know what you're doing (for now

)

Learned something bout tar archives here. Interesing. Thanks.

PS It had been a while since I compiled software like this. What the heck is the "bootstrap" script doing I wonder. After git-ing the sources, bootstrap starts git-ing some more and on top of that requires "cvs"....

ta0kira · 06-26-2009, 09:40 AM

I'm pretty sure bootstrap grabs other GNU sources and "borrows" them; however, you can make dist a self-contained package once you ./configure. I also had to install texinfo, flex, and bison (I had autotools installed already.)

Since you brought it up, here's what I'm doing. I wrote a backup system in bash that performs an mtree-like traversal of a specified directory to create a table of md5 sums (or device type, maj/min, etc.) It compares the table with the most-recent backup's table and backs up only changed files. This is simply done by taring them over ssh. No compression because speed is more important (might run bzip2 as a cron job on the server) and backups are tarred to retain all file attributes. The backup part of it works amazingly; however, I was at a loss when trying to figure out how to restore just a single file or how to restore everything without transmitting the original file and all of the changes.

So here's what I plan to do for recovery:

Pass a list of files to a script (run by the server)
Have the script locate the most-recent partial backup
Extract all listed files contained in the archive (sent to the client)
Remove the names of the extracted files from the list
Locate the next-most recent backup and go back to 3
End when the list is empty or the oldest backup has been processed

I like having the data stored in an extremely common format, and I also like the simplicity of running it with bash; however, it does have some weaknesses to it. Those shouldn't be a problem if access is controlled and the scripts aren't intentionally misused, however.
Kevin Barry