LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 06-19-2018, 04:18 AM   #1
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Rep: Reputation: 47
re-tar


I need to re-pack tar archive to change the file order in tar.gz, because I forgot to use '--sort=name'.

Is there a way to do it, without unpacking whole archive first?

thanks.
 
Old 06-19-2018, 05:35 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,838

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
is there any reason to do that?
 
Old 06-19-2018, 05:53 AM   #3
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Original Poster
Rep: Reputation: 47
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.
 
Old 06-19-2018, 06:47 AM   #4
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
I've looked at a few combinations but it looks like the only way is to extract to a directory and then build a new tarball from that. If you have GNU tar, then you might look at the --sort option or else use find and sort to feed tar. The main obstacle is that tar has its origins with magnetic tapes which are pretty much only sequential and don't really do random access well.

Last edited by Turbocapitalist; 06-19-2018 at 06:48 AM.
 
1 members found this post helpful.
Old 06-19-2018, 08:15 AM   #5
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
Sorting is a relative (in the sense of limited) useless operation. In tape times you could have wanted to have some files at the start of the tape to retrieve them earlier.

If you untar a file to disk it is the question whether time is relevant. And secondly, an untarred archive is really not written sorted to disk sorted in any way. File appear sorted in alphbetical order because ls sorts them by default in alphabetical order.

jlinkels
 
Old 06-19-2018, 08:55 AM   #6
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by qrange View Post
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.
I know you Solved the thread, and I concur with the advices citing that tar is just a Tape ARchive, because it is and you can look at a non-compressed tar file and just see your files concatenated.


You can also individually add or extract files to/from a tar file.


Why I've highlighted your comment is my question: What do you mean by a slightly better ratio? Of compression?


If so, the answer should technically be "no". Each compression utility should be agnostic to the data.


Anyways, just wondering what the intention of that was, if any.
 
Old 06-19-2018, 11:42 AM   #7
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Original Poster
Rep: Reputation: 47
yes, it should get better compression, and no, compression is not agnostic to data.
a bunch of zeroes compresses a lot better than random numbers.
 
Old 06-19-2018, 11:54 AM   #8
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
I realize it would be a small sample size but it would be interesting if you would check the size of the final tarballs and see if the order of the data has much effect on the compression. When I ran tar over my own source files adding --sort=name actually increased the size a litle.

Last edited by Turbocapitalist; 06-19-2018 at 11:57 AM.
 
Old 06-19-2018, 11:58 AM   #9
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Original Poster
Rep: Reputation: 47
er, i've used gzip with small level (-2) and difference was almost negligible.
but 7z with large dictionary size (1Gb) might benefit more. will test later.


edit: iirc, 7z sorts files by extension for the same reason, that could be better option.

Last edited by qrange; 06-19-2018 at 12:07 PM.
 
Old 06-19-2018, 01:27 PM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,838

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
Quote:
Originally Posted by qrange View Post
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.
based what I know about gzip compression it is completely irrelevant. But if you could measure it and prove it I will accept...
 
Old 06-19-2018, 01:37 PM   #11
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
I would say that the same compression utility on the same data, just if the data is ordered differently, will get similar results.

Yes some programs do better than others. If that's important for you, then it is.
 
Old 06-19-2018, 02:34 PM   #12
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Original Poster
Rep: Reputation: 47
@pan64

gzip in this case sees a single huge 'file'.
data order matters, otherwise, in any given file, we could simply group ones and zeroes

anyway, i was going to use de-duplicating software/compressors on several huge .tars and later 7zip, and still think this re-ordering would help.


files are huge, and every (solid) compressor has a 'window' in which it searches for 'similarities'. so its important to keep similar data close together.

Last edited by qrange; 06-19-2018 at 02:37 PM.
 
Old 06-20-2018, 12:34 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,838

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
http://blog.servergrove.com/2014/04/...ression-works/
 
1 members found this post helpful.
Old 06-20-2018, 05:25 AM   #14
qrange
Senior Member
 
Registered: Jul 2006
Location: Belgrade, Yugoslavia
Distribution: Debian stable/testing, amd64
Posts: 1,061

Original Poster
Rep: Reputation: 47
thanks.
so it uses this: https://en.wikipedia.org/wiki/DEFLATE

and the 'window' is:
Quote:
Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the beginning of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 KB of uncompressed data decoded (termed the sliding window).
anyhow, I've created a perl script that tries to find optimal file order:
https://encode.ru/threads/2969-files-reordering

so if you got CPU cycles to burn...
 
Old 06-20-2018, 07:09 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,838

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
so your files are huge, that probably means megabytes or gigabytes. The window size is 32 kb which is most probably less than 1% of your files. reordering your files may mean gzip will be able [a bit] better optimize that 1%, but will do exactly the same job within your files (excluding the first and last 32 kb).
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how can i decompress this tar.tar file? hmmm sounds new.. tar.tar.. help ;) kublador Linux - Software 14 10-25-2016 02:48 AM
"Invalid tar magic" error msg. when I try to tar ldmud *.tar file in DSL pixxi451 Linux - Newbie 4 07-04-2010 08:32 AM
BackUp & Restore with TAR (.tar / .tar.gz / .tar.bz2 / tar.Z) asgarcymed Linux - General 5 12-31-2006 02:53 AM
tar | ssh (tar > .tar) syntax issues EarlMosier Linux - Software 6 12-21-2006 12:28 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:37 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration