LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-27-2011, 03:38 AM   #1
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038

Rep: Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203
Preference on highest compression


So I am really looking for the shortest / easiest method to make sure I download the smallest file (ie the one with the highest compression).

So here are my criteria:

Compression in preferred order (most desired at the top)-
Code:
xz
lzma
bz2
gz
tgz
zip
Note: I realise that all of these have options to improve their default compression options,
but for the sake of argument we will assume the default or if the same then the above order wins out.

Input file (example):

Original is from a wget of http://www.python.org/ftp/python/3.2/ into a file called python.
On this file the following code is run over the file:
Code:
egrep -io "python-[0-9][[:alnum:].-]+[bglx]z(2|ma)?" python | sort -V | uniq > output
So now our output file looks like:
Code:
Python-3.2a1.tar.bz2
Python-3.2a1.tgz
Python-3.2a2.tar.bz2
Python-3.2a2.tgz
Python-3.2a3.tar.bz2
Python-3.2a3.tgz
Python-3.2a4.tar.bz2
Python-3.2a4.tgz
Python-3.2b1.tar.bz2
Python-3.2b1.tar.xz
Python-3.2b1.tgz
Python-3.2b2.tar.bz2
Python-3.2b2.tar.xz
Python-3.2b2.tgz
Python-3.2rc1.tar.bz2
Python-3.2rc1.tar.xz
Python-3.2rc1.tgz
So previously I was adding a 'tail -n1' to the above command, but this of course retrieves
the 'tgz' file when in this case I would prefer the highest version ending in 'xz'.

So if anyone would like to advise on a better / shorter method than a bunch of else / ifs it
would be appreciated

Script is currently written in bash.

Last edited by grail; 01-29-2011 at 12:35 AM.
 
Old 01-27-2011, 08:31 AM   #2
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,372

Rep: Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389Reputation: 5389
Uhm, that page list file sizes. If you want to download the smallest files, then you should look at the size of the file, not its extension.

A "worse" algorithm can, in some cases, give better results that a "better" algorithm, depending on the source data.
 
Old 01-27-2011, 09:11 AM   #3
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Remember that you can also work with the size and date parameters in html, so you can sort by date then use bash:
http://www.python.org/ftp/python/3.2/?C=M;O=D
 
Old 01-27-2011, 10:01 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038

Original Poster
Rep: Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203
Thanks for the feedback folks My issue is that this was mainly an example to show the data I am dealing with. As an alternative, to the following site has no size and others have dates
in places not easily identified as related to the file I might be looking for:

http://procps.sourceforge.net/

The upside here is that there is only one reference on the entire page to the file I need, but this is required to work for all sites to be used as a download.
 
Old 01-27-2011, 10:20 AM   #5
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
You're right, actually, I too have found this to be a problem. I think what is needed is a bulletproof algorithm. I'm not sure if it exists, because how exactly do you sort version numbers correctly ? This is the biggest problem, the others can be solved. You want the latest version right ?

The date may not be provided, and even if it is, they may update older packages.

I'll try some things and see if I can find a way, maybe I will make a script and use it as well.
 
Old 01-27-2011, 10:22 AM   #6
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,148
Blog Entries: 2

Rep: Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886
Just a sidenote: In your list you declare .gz as better format than .tgz, but actually they are the same.
 
Old 01-27-2011, 10:53 AM   #7
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Thoughts so far:

Use:
Code:
grep -o '<a href="[-._a-zA-Z0-9]*">' index.html | grep -o 'Python-[-._a-zA-Z0-9]*' | sort -V | grep 'gz$'
'Python-' is going to be variable, and you can run this with grep 'gz$', bz2$, xz$, etc. to make separate lists.
Next use rev and sed or cut to clip the extensions from each list.
Compare versions between lists to determine the latest version.
Download the best type as long as it has the latest version.

Am tired now, so will implement this later next week.
 
Old 01-27-2011, 07:22 PM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038

Original Poster
Rep: Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203
Quote:
Originally Posted by TobiSGD
Just a sidenote: In your list you declare .gz as better format than .tgz, but actually they are the same.
You are of course correct, but I need an order so I am favouring tar.compression over combined tgz and zip was last only due to low popularity (if at all)

@H_TeXMeX_H - I am not sure this is more efficient than my current method as it already sorts the versions correctly and provides the list as shown in post #1.

I thought about it yesterday and realised it may not make as much sense on its own, so see my other question that contains most of the script
here

As an addendum to this question, has anyone a quick way of getting only those files of all the same version as a group (read array)?
 
Old 01-27-2011, 08:18 PM   #9
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Note: I realise that all of these have options to improve their default compression options,
but for the sake of argument we will assume the default or if the same then the above order wins out.
Just remember that for every lossless compression algorithm there is some set of data it instead makes larger; it's mathematics. Even if you had a program that selected the "best" algorithm for the data and added a byte to indicate which algorithm, the same would be true.

bzip2 is supposed to be better than gzip for text. I'm not sure about the others, though. Also remember that zip is on a per-file basis, whereas bzip2 and gzip compress the whole archive, theoretically allowing for more compression. You should research the intended purpose of each algorithm and prioritize them based on what you know about what you're downloading.
Kevin Barry
 
1 members found this post helpful.
Old 01-28-2011, 10:25 AM   #10
orgcandman
Member
 
Registered: May 2002
Location: new hampshire
Distribution: Fedora, RHEL
Posts: 600

Rep: Reputation: 110Reputation: 110
Quote:
Originally Posted by ta0kira View Post
Just remember that for every lossless compression algorithm there is some set of data it instead makes larger; it's mathematics. Even if you had a program that selected the "best" algorithm for the data and added a byte to indicate which algorithm, the same would be true.
Kevin,
This is not exactly true. Lets say you take a whole byte for indicating which algorithm. Take an additional x bytes (say 4, for simplicity) that indicate this is indeed a file for your 'magic' compression. If you cannot compress, there's no need to actually write out a file thats 5 bytes+ larger. And if your compressor would only save 5 or fewer bytes, in this example, you would fail the compression, stating 'not able to compress'. It's something of a cheat, since you don't actually end up with a compression file - but when every byte is critical (say, in something like stream compression) then you need to know when you should and shouldn't waste the space resource.
As a matter of nit, the wikipedia article only states that there is no way to write an algorithm which always makes data[X] => data[X-1], and data[X-1] => data[X]. Clearly, your algorithm (with my improvement) is better than your algorithm on its own, but it will never always compress all the data. However, it will also never expand the data, meaning that the worst case is no worse than not using compression. IE: A lossless compression algorithm which never makes the data set larger.

Pedantic Regards ,
-Aaron

Last edited by orgcandman; 01-28-2011 at 10:26 AM.
 
Old 01-28-2011, 11:42 AM   #11
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
I was going to mention an "exception" in which some of the information about the file is retained somewhere else and not transmitted with the file. In the case of a non-compressed file, the "magic" system retains information about the type of file and would deduce that it wasn't compressed. You're just out of luck if the file happens to have the same magic signature that a compression program uses. You must consider "magic" as a part of the compression system; therefore, this "workaround" introduces a flaw in the system. Suppose you intuit that a file is how it's supposed to be if it can't be expanded despite having a certain magic signature. You can also have completely random data that appears to be an entirely-valid compressed file. In conclusion, the math is right; the only way around it is to have certain-error in some rare cases. Mathematically they aren't that rare; they just seem that way because of the extreme orderliness of the data we use.
Kevin Barry

PS Don't forget that appending .bz2 and .gz to the file name increases the data size!

Last edited by ta0kira; 01-28-2011 at 11:45 AM.
 
Old 01-28-2011, 01:42 PM   #12
orgcandman
Member
 
Registered: May 2002
Location: new hampshire
Distribution: Fedora, RHEL
Posts: 600

Rep: Reputation: 110Reputation: 110
So that we don't derail the OPs thread, I'll take my responses to PM (or a separate thread, if you want to open one).
 
Old 01-28-2011, 02:47 PM   #13
gnashley
Amigo developer
 
Registered: Dec 2003
Location: Germany
Distribution: Slackware
Posts: 4,928

Rep: Reputation: 614Reputation: 614Reputation: 614Reputation: 614Reputation: 614Reputation: 614
I would just always prefer the xz comressed tarballs(or lzma as it si nearly exactly the same), then bzip2, then gzip, then zip -exactly as you outlined. I wouldn't worry about 5 bytes or some other ridiculous difference. Anywhere that size would make a real difference it *will* and xz will be the smallest. There is a penalty on decompression time with xz, but it is still faster than bunzip2. However, xz takes *longer* to compress archives than bzip2.
lzma and xz use (I think) exactly the same compression algorithm. lzma was abandoned because the file format was inferior -it didn't provide any handy means of identifying the files -they only show up as being 'data' when examined with 'file', whereas xz has fixed this so the files can be properly identified. 7zip is nearly the same as xz, but uses a slightly different algorithm -ther is *some* tool or library out there which can deal with them both.

Since I have some idea of what you are doing with your code, I'd point out that sometimes you may want or need to deal with rpm source archives -some progs/libs are not easily found in other format.

As for retreiving the original list of archives, if all you have is hhtp access, then you have to parse the list from html output, but if ftp is supported, you may find lsftp useful:
http://sourceforge.net/projects/lsftp/files/
 
Old 01-28-2011, 08:23 PM   #14
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by orgcandman View Post
So that we don't derail the OPs thread, I'll take my responses to PM (or a separate thread, if you want to open one).
Sorry, I gave too much argument and not enough conclusion. Here's a summary, to-the-point and on topic.

"bzip2 is better than gzip" (etc.) is only an effective heuristic if you constrain the source data. You cannot say that either is better in absolutely all cases; you can't even say that either is always better than the uncompressed data. If you compare original, bzip2ed, and gzipped versions of all possible files, there will inevitably be a significant number of files where each respective version comes out smaller than the other two. If you assume that the maintainer will skip compression if it will make the file larger, you should just go ahead and assume that he or she will post only the smallest version of the file.
Kevin Barry
 
Old 01-28-2011, 11:55 PM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038

Original Poster
Rep: Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203Reputation: 3203
@Kevin & Aaron - whilst your debate is interesting, it is a little heavier than the intent of this question. My reason for setting a somewhat arbitrary order is simply due to the fact that when downloading the source files for an application the order I have presented is generally the order in which you would find the files ordered were the size of each to also be included.

@gnashley - Thanks as always for replying Yes I am aware of the other types which I may eventually need to download, like rpms, but that is a little way off at the moment.
As it appears that there is no real uniformity around always having a ftp, http sites or even a svn or git repositories (also trying to avoid these at the moment <sheesh>) I am using wget
to retrieve the format and weed out the file names with my handy little egrep line and then workout a way to get the file with the lowest compression (ie smallest file) based on the order provided
above.

So to consolidate the question again, there are now two things I am looking to nut out:

1. Retrieve only those files with the highest version number, from example above this would be:
Code:
Python-3.2rc1.tar.bz2
Python-3.2rc1.tar.xz
Python-3.2rc1.tgz
Here you could have used a simple tail to retrieve the last 3 lines, however there is no guarantee of 3 files (see post #4 above where there is only 1 file)

2. Once the above has been handled, retrieve the one with the best compression based on the list order being:
Code:
xz
lzma
bz2
gz
tgz
zip
Look forward to all your feedback

I will post any solutions I should come up with so they can be reviewed and / or improved
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Highest load you've ever seen TheDude05 Linux - General 3 08-08-2009 11:33 PM
Slax with SquashFS-4 new compression algorithm and layered compression ratios? lincaptainhenryjbrown Linux - Software 2 06-19-2009 05:29 PM
Is gzip -c9 giving the highest compression? Thaidog Linux - General 2 04-23-2007 05:15 AM
highest salary manju_se7en General 2 04-21-2007 04:31 PM
highest salary manju_se7en Linux - General 3 04-21-2007 07:17 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:26 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration