[SOLVED] re-tar

qrange · 06-19-2018, 04:18 AM

I need to re-pack tar archive to change the file order in tar.gz, because I forgot to use '--sort=name'.

Is there a way to do it, without unpacking whole archive first?

thanks.

pan64 · 06-19-2018, 05:35 AM

is there any reason to do that?

qrange · 06-19-2018, 05:53 AM

thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.

Turbocapitalist · 06-19-2018, 06:47 AM

I've looked at a few combinations but it looks like the only way is to extract to a directory and then build a new tarball from that. If you have GNU tar, then you might look at the --sort option or else use find and sort to feed tar. The main obstacle is that tar has its origins with magnetic tapes which are pretty much only sequential and don't really do random access well.

jlinkels · 06-19-2018, 08:15 AM

Sorting is a relative (in the sense of limited) useless operation. In tape times you could have wanted to have some files at the start of the tape to retrieve them earlier.

If you untar a file to disk it is the question whether time is relevant. And secondly, an untarred archive is really not written sorted to disk sorted in any way. File appear sorted in alphbetical order because ls sorts them by default in alphabetical order.

jlinkels

rtmistler · 06-19-2018, 08:55 AM

Quote:

Originally Posted by qrange

thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.

I know you Solved the thread, and I concur with the advices citing that tar is just a Tape ARchive, because it is and you can look at a non-compressed tar file and just see your files concatenated.

You can also individually add or extract files to/from a tar file.

Why I've highlighted your comment is my question: What do you mean by a slightly better ratio? Of compression?

If so, the answer should technically be "no". Each compression utility should be agnostic to the data.

Anyways, just wondering what the intention of that was, if any.

qrange · 06-19-2018, 11:42 AM

yes, it should get better compression, and no, compression is not agnostic to data.
a bunch of zeroes compresses a lot better than random numbers.

Turbocapitalist · 06-19-2018, 11:54 AM

I realize it would be a small sample size but it would be interesting if you would check the size of the final tarballs and see if the order of the data has much effect on the compression. When I ran tar over my own source files adding --sort=name actually increased the size a litle.

qrange · 06-19-2018, 11:58 AM

er, i've used gzip with small level (-2) and difference was almost negligible.
but 7z with large dictionary size (1Gb) might benefit more. will test later.

edit: iirc, 7z sorts files by extension for the same reason, that could be better option.

pan64 · 06-19-2018, 01:27 PM

Quote:

Originally Posted by qrange

thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression.
in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio.

based what I know about gzip compression it is completely irrelevant. But if you could measure it and prove it I will accept...

rtmistler · 06-19-2018, 01:37 PM

I would say that the same compression utility on the same data, just if the data is ordered differently, will get similar results.

Yes some programs do better than others. If that's important for you, then it is.

qrange · 06-19-2018, 02:34 PM

@pan64

gzip in this case sees a single huge 'file'.
data order matters, otherwise, in any given file, we could simply group ones and zeroes

anyway, i was going to use de-duplicating software/compressors on several huge .tars and later 7zip, and still think this re-ordering would help.

files are huge, and every (solid) compressor has a 'window' in which it searches for 'similarities'. so its important to keep similar data close together.

pan64 · 06-20-2018, 12:34 AM

http://blog.servergrove.com/2014/04/...ression-works/

qrange · 06-20-2018, 05:25 AM

thanks.
so it uses this: https://en.wikipedia.org/wiki/DEFLATE

and the 'window' is:

Quote:

Within compressed blocks, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (3–258 bytes) and a 15-bit distance (1–32,768 bytes) to the beginning of the duplicate. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 KB of uncompressed data decoded (termed the sliding window).

anyhow, I've created a perl script that tries to find optimal file order:
https://encode.ru/threads/2969-files-reordering

so if you got CPU cycles to burn...

pan64 · 06-20-2018, 07:09 AM

so your files are huge, that probably means megabytes or gigabytes. The window size is 32 kb which is most probably less than 1% of your files. reordering your files may mean gzip will be able [a bit] better optimize that 1%, but will do exactly the same job within your files (excluding the first and last 32 kb).