ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a tar archive. I need to be able to extract a subset of the files in a way that allows them to be sent over ssh to another machine to be saved there. As an example, I created the archive with something like this:
Code:
tar -c files... | ssh user@serversave-tar.sh#<-- determines the archive name and location
What I need is the reverse of that process without having to temporarily extract the files where the archive is. Unfortunately I can't just cat the archive across the network. For one, because I'm not extracting all of it (as little as a few MB from a 300GB+ archive.) Secondly, because the decision as to which archive a file will be extracted from is made on the same machine that stores the archive.
Essentially, I need to extract files from the archive into another archive (-xc is pretty much what I'm looking for,) which is sent to standard output. Thanks.
Kevin Barry
Since GNU tar (at least my version) does not accept -x and -c options at the same time it is simply just not possible (with GNU tar) to do without temporarily extracting the files.
What you would need is some kind tar-file-converter or some kind of enhanced tar program that support -x and -c together.
The simplest solution would be to accept that you just need to temporarily extract the file from the tar archive. Like you said, it is just a few Mb's.
The simplest solution would be to accept that you just need to temporarily extract the file from the tar archive. Like you said, it is just a few Mb's.
I might actually write the code myself. The problem with extracting the files temporarily is I'd need to run the server-side script as root, which would either require passwordless sudo or RSA login for the root account; otherwise I'd lose all of the ownership attributes of the files. "Just a few MB" is the minimum (to illustrate how absurd it would be to transmit the archive); however, it could be the entire archive at once.
Kevin Barry
Not easy I suppose. Would be pretty cool IMHO if you managed to do that, especially if the GNU project accepts your patch to have GNU tar support -xc :-)
Quote:
Originally Posted by ta0kira
The problem with extracting the files temporarily is I'd need to run the server-side script as root, which would either require passwordless sudo or RSA login for the root account; otherwise I'd lose all of the ownership attributes of the files.
I don't know the details, but I believe a program like "fakeroot" can fix that. BTW IIRC "fakeroot" is a debian specific tool, but there are others...
Quote:
Originally Posted by ta0kira
"Just a few MB" is the minimum (to illustrate how absurd it would be to transmit the archive); however, it could be the entire archive at once.
Hmm yes, that is another issue. It will quite an overhead to un-tar and tar those, especially in those cases where in the end the entire file will cross the network anyway. But that will be the case either way whether temporarily extracting or not.
Not easy I suppose. Would be pretty cool IMHO if you managed to do that, especially if the GNU project accepts your patch to have GNU tar support -xc :-)
I don't think there's much to it since an archive is merely a collection of archives. I think I just need to jump from one header to the next and write the header + data section to standard out for each file I want to extract. In fact <tar.h> contains the header structure already. I think the hardest part might be selecting the files using a list, but I'm considering a ternary search tree for that. Or, depending on how convoluted the tar code is, I might just be able to have it print the header as a modification of -xO. I'll let you know. Thanks.
Kevin Barry
In fact <tar.h> contains the header structure already.
Ah. I wasn't aware that tar.h exists. On my system (ubuntu 9.04) it only contains a bunch of macros to define some constants and bitmasks though.
You may be already aware of it, but now I also found a "man 5 tar" which has quite a lot of info about all the different tar formats. On ubuntu (so probably debian as well) it is in the package "libarchive1".
When I checked which other file this packages contains, I found "man 5 libarchive-formats" which I think may useful to you.
Thanks for the help. It was pretty easy once I got a handle on the format. Most of my work was realizing that headers and files are padded to 512-byte blocks. I submitted the patch just now. I also attached the src portion of the patch below (against the most recent git revision.) I'll be adding the hacked version to my system under a suffixed name.
Kevin Barry
Wow. You got yourself familiar pretty quickly with the tar sources. I'm impressed.
Quote:
Originally Posted by Your patch submission message
Code:
tar -xOQf huge.tar smallfile | ssh me@client tar -x
But what I don't get is: what is "smallfile" doing in this command line? Wasn't it the idea that the "smallfile" would be serialized by tar to send it over the wire and never exist on the filesystem?
But what I don't get is: what is "smallfile" doing in this command line? Wasn't it the idea that the "smallfile" would be serialized by tar to send it over the wire and never exist on the filesystem?
smallfile is contained within the archive; without -OQ it would be saved to the file system, and without -Qonly the file's contents would be sent to standard output. If you were to list several files for extraction without -Q you'd be unable to distinguish between them. The option I added first outputs the file's archive header, then the file's contents (both padded %512 with 0x00.) This gives you a new archive that's a subset of the original archive. For example:
Code:
bash> tar -c directory | ssh user@server "cat > huge.tar" #<-- archive to serverbash> ssh user@server tar -xOQf huge.tar directory/myfile | tar -x #<-- restore from server
Kevin Barry
PS As another example:
Code:
bash> tar -c /usr > usr.tar
bash> tar -xOQf usr.tar usr/local > usr-local.tar
PPS This actually doesn't work for other-than regular files, so I'll have to take another look. I don't think it will be much of a problem, though.
P3S It works for all file types now. See the revised attachment above, or go to the patch request.
Thanks for the explanation. I get it, and tried it on a small test.tar and it worked. Nice!
It sounds like you're using tar-archives as an inefficient remote filesystem though.. But I assume you know what you're doing (for now )
Learned something bout tar archives here. Interesing. Thanks.
PS It had been a while since I compiled software like this. What the heck is the "bootstrap" script doing I wonder. After git-ing the sources, bootstrap starts git-ing some more and on top of that requires "cvs"....
I'm pretty sure bootstrap grabs other GNU sources and "borrows" them; however, you can make dist a self-contained package once you ./configure. I also had to install texinfo, flex, and bison (I had autotools installed already.)
Since you brought it up, here's what I'm doing. I wrote a backup system in bash that performs an mtree-like traversal of a specified directory to create a table of md5 sums (or device type, maj/min, etc.) It compares the table with the most-recent backup's table and backs up only changed files. This is simply done by taring them over ssh. No compression because speed is more important (might run bzip2 as a cron job on the server) and backups are tarred to retain all file attributes. The backup part of it works amazingly; however, I was at a loss when trying to figure out how to restore just a single file or how to restore everything without transmitting the original file and all of the changes.
So here's what I plan to do for recovery:
Pass a list of files to a script (run by the server)
Have the script locate the most-recent partial backup
Extract all listed files contained in the archive (sent to the client)
Remove the names of the extracted files from the list
Locate the next-most recent backup and go back to 3
End when the list is empty or the oldest backup has been processed
I like having the data stored in an extremely common format, and I also like the simplicity of running it with bash; however, it does have some weaknesses to it. Those shouldn't be a problem if access is controlled and the scripts aren't intentionally misused, however.
Kevin Barry
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.