re-tar
I need to re-pack tar archive to change the file order in tar.gz, because I forgot to use '--sort=name'.
Is there a way to do it, without unpacking whole archive first? thanks. |
is there any reason to do that?
|
thanks for reply.
well, yes. given that tar concatenates files, and gzip compresses whole thing, it sounds to me like 'solid' compression. in that case it is useful to have similar files closer together, therefore sorting by filenames might give slightly better ratio. |
I've looked at a few combinations but it looks like the only way is to extract to a directory and then build a new tarball from that. If you have GNU tar, then you might look at the --sort option or else use find and sort to feed tar. The main obstacle is that tar has its origins with magnetic tapes which are pretty much only sequential and don't really do random access well.
|
Sorting is a relative (in the sense of limited) useless operation. In tape times you could have wanted to have some files at the start of the tape to retrieve them earlier.
If you untar a file to disk it is the question whether time is relevant. And secondly, an untarred archive is really not written sorted to disk sorted in any way. File appear sorted in alphbetical order because ls sorts them by default in alphabetical order. jlinkels |
Quote:
You can also individually add or extract files to/from a tar file. Why I've highlighted your comment is my question: What do you mean by a slightly better ratio? Of compression? If so, the answer should technically be "no". Each compression utility should be agnostic to the data. Anyways, just wondering what the intention of that was, if any. |
yes, it should get better compression, and no, compression is not agnostic to data.
a bunch of zeroes compresses a lot better than random numbers. |
I realize it would be a small sample size but it would be interesting if you would check the size of the final tarballs and see if the order of the data has much effect on the compression. When I ran tar over my own source files adding --sort=name actually increased the size a litle.
|
er, i've used gzip with small level (-2) and difference was almost negligible.
but 7z with large dictionary size (1Gb) might benefit more. will test later. edit: iirc, 7z sorts files by extension for the same reason, that could be better option. |
Quote:
|
I would say that the same compression utility on the same data, just if the data is ordered differently, will get similar results.
Yes some programs do better than others. If that's important for you, then it is. |
@pan64
gzip in this case sees a single huge 'file'. data order matters, otherwise, in any given file, we could simply group ones and zeroes ;) anyway, i was going to use de-duplicating software/compressors on several huge .tars and later 7zip, and still think this re-ordering would help. files are huge, and every (solid) compressor has a 'window' in which it searches for 'similarities'. so its important to keep similar data close together. |
|
thanks.
so it uses this: https://en.wikipedia.org/wiki/DEFLATE and the 'window' is: Quote:
https://encode.ru/threads/2969-files-reordering so if you got CPU cycles to burn... |
so your files are huge, that probably means megabytes or gigabytes. The window size is 32 kb which is most probably less than 1% of your files. reordering your files may mean gzip will be able [a bit] better optimize that 1%, but will do exactly the same job within your files (excluding the first and last 32 kb).
|
All times are GMT -5. The time now is 03:34 PM. |