Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.
I've looked at a few combinations but it looks like the only way is to extract to a directory and then build a new tarball from that. If you have GNU tar, then you might look at the --sort option or else use find and sort to feed tar. The main obstacle is that tar has its origins with magnetic tapes which are pretty much only sequential and don't really do random access well.
Last edited by Turbocapitalist; 06-19-2018 at 06:48 AM.
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195
Rep:
Sorting is a relative (in the sense of limited) useless operation. In tape times you could have wanted to have some files at the start of the tape to retrieve them earlier.
If you untar a file to disk it is the question whether time is relevant. And secondly, an untarred archive is really not written sorted to disk sorted in any way. File appear sorted in alphbetical order because ls sorts them by default in alphabetical order.
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.
I know you Solved the thread, and I concur with the advices citing that tar is just a Tape ARchive, because it is and you can look at a non-compressed tar file and just see your files concatenated.
You can also individually add or extract files to/from a tar file.
Why I've highlighted your comment is my question: What do you mean by a slightly better ratio? Of compression?
If so, the answer should technically be "no". Each compression utility should be agnostic to the data.
Anyways, just wondering what the intention of that was, if any.
I realize it would be a small sample size but it would be interesting if you would check the size of the final tarballs and see if the order of the data has much effect on the compression. When I ran tar over my own source files adding --sort=name actually increased the size a litle.
Last edited by Turbocapitalist; 06-19-2018 at 11:57 AM.
er, i've used gzip with small level (-2) and difference was almost negligible.
but 7z with large dictionary size (1Gb) might benefit more. will test later.
edit: iirc, 7z sorts files by extension for the same reason, that could be better option.
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.
based what I know about gzip compression it is completely irrelevant. But if you could measure it and prove it I will accept...
gzip in this case sees a single huge 'file'.
data order matters, otherwise, in any given file, we could simply group ones and zeroes
anyway, i was going to use de-duplicating software/compressors on several huge .tars and later 7zip, and still think this re-ordering would help.
files are huge, and every (solid) compressor has a 'window' in which it searches for 'similarities'. so its important to keep similar data close together.
Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the beginning of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 KB of uncompressed data decoded (termed the sliding window).
so your files are huge, that probably means megabytes or gigabytes. The window size is 32 kb which is most probably less than 1% of your files. reordering your files may mean gzip will be able [a bit] better optimize that 1%, but will do exactly the same job within your files (excluding the first and last 32 kb).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.