LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-24-2009, 08:25 PM   #1
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
extracting tar archive over a network


I have a tar archive. I need to be able to extract a subset of the files in a way that allows them to be sent over ssh to another machine to be saved there. As an example, I created the archive with something like this:
Code:
tar -c files... | ssh user@server save-tar.sh #<-- determines the archive name and location
What I need is the reverse of that process without having to temporarily extract the files where the archive is. Unfortunately I can't just cat the archive across the network. For one, because I'm not extracting all of it (as little as a few MB from a 300GB+ archive.) Secondly, because the decision as to which archive a file will be extracted from is made on the same machine that stores the archive.

Essentially, I need to extract files from the archive into another archive (-xc is pretty much what I'm looking for,) which is sent to standard output. Thanks.
Kevin Barry
 
Old 06-25-2009, 03:22 AM   #2
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
Since GNU tar (at least my version) does not accept -x and -c options at the same time it is simply just not possible (with GNU tar) to do without temporarily extracting the files.

What you would need is some kind tar-file-converter or some kind of enhanced tar program that support -x and -c together.

With some luck such program may already exist. Or it might even be doable to code one yourself.

The simplest solution would be to accept that you just need to temporarily extract the file from the tar archive. Like you said, it is just a few Mb's.
 
Old 06-25-2009, 05:18 AM   #3
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Hko View Post
The simplest solution would be to accept that you just need to temporarily extract the file from the tar archive. Like you said, it is just a few Mb's.
I might actually write the code myself. The problem with extracting the files temporarily is I'd need to run the server-side script as root, which would either require passwordless sudo or RSA login for the root account; otherwise I'd lose all of the ownership attributes of the files. "Just a few MB" is the minimum (to illustrate how absurd it would be to transmit the archive); however, it could be the entire archive at once.
Kevin Barry

Last edited by ta0kira; 06-25-2009 at 05:20 AM.
 
Old 06-25-2009, 06:38 AM   #4
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
Quote:
Originally Posted by ta0kira View Post
I might actually write the code myself.
Not easy I suppose. Would be pretty cool IMHO if you managed to do that, especially if the GNU project accepts your patch to have GNU tar support -xc :-)

Quote:
Originally Posted by ta0kira View Post
The problem with extracting the files temporarily is I'd need to run the server-side script as root, which would either require passwordless sudo or RSA login for the root account; otherwise I'd lose all of the ownership attributes of the files.
I don't know the details, but I believe a program like "fakeroot" can fix that. BTW IIRC "fakeroot" is a debian specific tool, but there are others...

Quote:
Originally Posted by ta0kira View Post
"Just a few MB" is the minimum (to illustrate how absurd it would be to transmit the archive); however, it could be the entire archive at once.
Hmm yes, that is another issue. It will quite an overhead to un-tar and tar those, especially in those cases where in the end the entire file will cross the network anyway. But that will be the case either way whether temporarily extracting or not.

Good luck

Last edited by Hko; 06-25-2009 at 06:39 AM.
 
Old 06-25-2009, 07:07 AM   #5
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Hko View Post
Not easy I suppose. Would be pretty cool IMHO if you managed to do that, especially if the GNU project accepts your patch to have GNU tar support -xc :-)
I don't think there's much to it since an archive is merely a collection of archives. I think I just need to jump from one header to the next and write the header + data section to standard out for each file I want to extract. In fact <tar.h> contains the header structure already. I think the hardest part might be selecting the files using a list, but I'm considering a ternary search tree for that. Or, depending on how convoluted the tar code is, I might just be able to have it print the header as a modification of -xO. I'll let you know. Thanks.
Kevin Barry
 
Old 06-25-2009, 11:52 AM   #6
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
Quote:
Originally Posted by ta0kira View Post
In fact <tar.h> contains the header structure already.
Ah. I wasn't aware that tar.h exists. On my system (ubuntu 9.04) it only contains a bunch of macros to define some constants and bitmasks though.

You may be already aware of it, but now I also found a "man 5 tar" which has quite a lot of info about all the different tar formats. On ubuntu (so probably debian as well) it is in the package "libarchive1".

When I checked which other file this packages contains, I found "man 5 libarchive-formats" which I think may useful to you.
 
Old 06-25-2009, 03:08 PM   #7
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
Thanks for the help. It was pretty easy once I got a handle on the format. Most of my work was realizing that headers and files are padded to 512-byte blocks. I submitted the patch just now. I also attached the src portion of the patch below (against the most recent git revision.) I'll be adding the hacked version to my system under a suffixed name.
Kevin Barry
Attached Files
File Type: txt tar-plus-export-headers-src3.diff.txt (7.0 KB, 10 views)

Last edited by ta0kira; 06-28-2009 at 03:25 AM.
 
Old 06-25-2009, 06:12 PM   #8
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
Wow. You got yourself familiar pretty quickly with the tar sources. I'm impressed.

Quote:
Originally Posted by Your patch submission message
Code:
 tar -xOQf huge.tar smallfile | ssh me@client tar -x
But what I don't get is: what is "smallfile" doing in this command line? Wasn't it the idea that the "smallfile" would be serialized by tar to send it over the wire and never exist on the filesystem?

Am I missing something?
 
Old 06-25-2009, 08:47 PM   #9
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Hko View Post
But what I don't get is: what is "smallfile" doing in this command line? Wasn't it the idea that the "smallfile" would be serialized by tar to send it over the wire and never exist on the filesystem?
smallfile is contained within the archive; without -OQ it would be saved to the file system, and without -Q only the file's contents would be sent to standard output. If you were to list several files for extraction without -Q you'd be unable to distinguish between them. The option I added first outputs the file's archive header, then the file's contents (both padded %512 with 0x00.) This gives you a new archive that's a subset of the original archive. For example:
Code:
bash> tar -c directory | ssh user@server "cat > huge.tar"          #<-- archive to server
bash> ssh user@server tar -xOQf huge.tar directory/myfile | tar -x #<-- restore from server
Kevin Barry

PS As another example:
Code:
bash> tar -c /usr > usr.tar
bash> tar -xOQf usr.tar usr/local > usr-local.tar
PPS This actually doesn't work for other-than regular files, so I'll have to take another look. I don't think it will be much of a problem, though.

P3S It works for all file types now. See the revised attachment above, or go to the patch request.

Last edited by ta0kira; 06-25-2009 at 11:57 PM.
 
Old 06-26-2009, 03:19 AM   #10
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: Debian
Posts: 2,536

Rep: Reputation: 111Reputation: 111
Thanks for the explanation. I get it, and tried it on a small test.tar and it worked. Nice!

It sounds like you're using tar-archives as an inefficient remote filesystem though.. But I assume you know what you're doing (for now )

Learned something bout tar archives here. Interesing. Thanks.

PS It had been a while since I compiled software like this. What the heck is the "bootstrap" script doing I wonder. After git-ing the sources, bootstrap starts git-ing some more and on top of that requires "cvs"....

Last edited by Hko; 06-26-2009 at 03:23 AM.
 
Old 06-26-2009, 09:40 AM   #11
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
I'm pretty sure bootstrap grabs other GNU sources and "borrows" them; however, you can make dist a self-contained package once you ./configure. I also had to install texinfo, flex, and bison (I had autotools installed already.)

Since you brought it up, here's what I'm doing. I wrote a backup system in bash that performs an mtree-like traversal of a specified directory to create a table of md5 sums (or device type, maj/min, etc.) It compares the table with the most-recent backup's table and backs up only changed files. This is simply done by taring them over ssh. No compression because speed is more important (might run bzip2 as a cron job on the server) and backups are tarred to retain all file attributes. The backup part of it works amazingly; however, I was at a loss when trying to figure out how to restore just a single file or how to restore everything without transmitting the original file and all of the changes.

So here's what I plan to do for recovery:
  1. Pass a list of files to a script (run by the server)
  2. Have the script locate the most-recent partial backup
  3. Extract all listed files contained in the archive (sent to the client)
  4. Remove the names of the extracted files from the list
  5. Locate the next-most recent backup and go back to 3
  6. End when the list is empty or the oldest backup has been processed
I like having the data stored in an extremely common format, and I also like the simplicity of running it with bash; however, it does have some weaknesses to it. Those shouldn't be a problem if access is controlled and the scripts aren't intentionally misused, however.
Kevin Barry
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
a tough question 4 u, problem in extracting tar & tar.gz files p_garg Linux - General 5 11-08-2010 11:02 AM
Piping tar bzcat to add a file to a tar.bz2 archive DaveQB Linux - Software 0 06-02-2008 08:28 PM
Extracting archive file to /opt? DaftDave Programming 14 04-01-2008 06:46 AM
tar - extracting a single directory from a large archive? lowpro2k3 Linux - General 1 07-24-2005 02:44 AM
Extracting tar.bz2 archive through console grim_chel Linux - Newbie 3 06-28-2004 11:27 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration