[SOLVED] Preference on highest compression

grail · 01-29-2011, 12:36 AM

So something I was working on I thought might help get the ball rolling a little:

Code:

#!/bin/bash

# this is using the output file from original post
UPDATE=($(awk -F"[.]t" '{if(x == $1)arr = arr" "$0;else{ x = $1;arr=$0}}END{printf arr}' output))

for ext in xz lzma bz2 gz tgz
do
    for file in ${UPDATE[*]}
    do
        [[ ${file##*.} == $ext ]] && break 2
    done
done

echo $file

ta0kira · 01-29-2011, 01:17 PM

Have you considered sorting? You might try something like this:

Code:

#!/bin/bash


PRIORITY=( xz lzma tar.bz2 tar.gz tgz ) #<-- you only need to change this to change priorities
COUNT=$( echo $((${#PRIORITY[*]}-1)) )

#read a newline-separated list of file names
if [ $# -gt 0 ]; then
  UPDATE="$( cat "$1" )"
else
  UPDATE="$( cat )"
fi

#translate extensions into numerical priority (separated with "'"; assume it's not part of a file name)
for I in `eval echo {0..$COUNT}`; do
  UPDATE="$( echo "$UPDATE" | sed "s/\.${PRIORITY[$I]}$/'$I/" )"
done

#make a missing extension last priority
UPDATE="$( echo "$UPDATE" | sed "/'/! s/$/'$( echo $(($COUNT+1)) )/" )"

#sort, then keep only the highest priority for each file
DOWNLOAD="$( echo "$UPDATE" | sort | sort -t\' -u -k1,1 )"

#translate priority back to extension
for I in `eval echo {0..$COUNT}`; do
  DOWNLOAD="$( echo "$DOWNLOAD" | sed "s/'$I$/\.${PRIORITY[$I]}/" )"
done

DOWNLOAD="$( echo "$DOWNLOAD" | sed "s/'$( echo $(($COUNT+1)) )//" )"

#output results
echo "$DOWNLOAD"

Of course, you'd have to deal with extensions other-than those listed differently, e.g. remove those files from the list before processing it.

You can create a test list fairly easily:

Code:

echo file{1..5}{.{xz,lzma,tar.bz2,tar.gz,tgz},} | tr ' ' '\n' | sort -R | head -n20 > files.txt

Kevin Barry

PS Add INDEX="$( printf '%.2i\n' $I )" before the sed lines and use '$INDEX instead of '$I if you need more than 9 extensions in your list.

grail · 01-31-2011, 12:11 AM

Thanks for the feedback Kevin

You got me thinking and I have a new solution which combines both consolidation of the names from the website along with setting the compression to our order.
Let me know what you think:

Code:

#!/usr/bin/awk -f

BEGIN{
    IGNORECASE = 1
    PRIORITY = "xz lzma bz2 gz" # have left off tgz as it will be found by gz anyway
}

{ match($0, page"-[0-9][[:alnum:].-]+[bglx]z(2|ma)?", file) } # page is a passed in variable

length(file) > 0{
    version = gensub(/^.*-|[.]t.*$/,"","g",file[0])
    if(version in temp_arr){
        if(temp_arr[version] !~ file[0])
            temp_arr[version] = temp_arr[version]" "file[0]
    }
    else
        temp_arr[version] = file[0]
}

END{
    split(list_ext,extensions)
    n = asorti(temp_arr, sorted_temp_arr)
    split(temp_arr[sorted_temp_arr[n]], files)

    for(i = 1;i <= length(extensions); i++)
        for(j = 1;j <= length(files); j++)
            if(files[j] ~ extensions[i]"$"){
                print files[j]
                exit 0
            }

    print "Extension is not part of list
}

So test you can simply download a listing for any software into a file and then run.
To test based on our python site I have used the following:

Code:

$ wget -O python http://www.python.org/ftp/python/3.2/
$ ./script.awk -vpage=python python
Python-3.2rc1.tar.xz

EDIT: Turns out this sorts on string values one at a time hence 10 is less than 9 and wrong version is returned

Back to the drawing board.

grail · 01-31-2011, 03:10 AM

So here is take 2 ... and as always it is simpler .. funny how that works:

Code:

#!/usr/bin/awk -f

BEGIN{
    IGNORECASE = 1
    PRIORITY = "xz lzma bz2 gz" # have left off tgz as it will be found by gz anyway
}

{ match($0, page"-[0-9][[:alnum:].-]+[bglx]z(2|ma)?", file) } # page is a passed in variable

length(file) > 0 && file[0] !~ remove{
    ext = gensub(/^.*[.]/,"","1",file[0])
    if(ext in temp_arr){
        if(temp_arr[ext] !~ file[0])
            temp_arr[ext] = temp_arr[ext]"\n"file[0]
    }
    else
        temp_arr[ext] = file[0]
}

END{
    split(PRIORITY, extensions)

    for(i = 1;i <= length(extensions);i++)
        if(temp_arr[extensions[i]]){
            print temp_arr[extensions[i]] | "sort -V | tail -n 1"
            break
        }
}

Happy as usual for anyone to point out any obvious gotchas (if I don't find them first

)

I guess the comparison now is to see if a bash only solution (after egrep and sort -V) is any easier or more sustainable?

grail · 02-09-2011, 10:39 PM

So it didn't take too long to find that if the best compression, based on my order, is on a lower version and not on the higher one then I will actually roll back
the version

So my new line of thinking here is to get the egrepped and version sorted data and try to manipulate this.
This has now raised a new question based on regular expressions which has me stuck.
Lets say we have reduced the list to only the highest version so that the input looks like:

Code:

Python-3.2rc1.src.rpm    # I know this does not really exist but I am trying to allow for it
Python-3.2rc1.tar.bz2
Python-3.2rc1.tar.gz
Python-3.2rc1.tar.lzma
Python-3.2rc1.tar.xz
Python-3.2rc1.tgz

So my regex question is this - based on the above input, return only the version and extension with some kind of separator:

Code:

3.2rc1|rpm
3.2rc1|bz2
3.2rc1|gz
3.2rc1|lzma
3.2rc1|xz
3.2rc1|tgz

We can assume the only known information will be the application name, in this case Python.

I will post should I work it out, but happy to see others.

Note: I will be using match command from awk, but regex using sed or other is fine.

gnashley · 02-10-2011, 02:26 AM

I have a nice two-hundred-liner is src2pkg which works out such names, LOL, including the 'src', 'source', 'git' ,etc. Your assumption that you can be sure of the name is a biig assumption... Then there are names like 'xterm224', names with '_' in them, sources with no version number at all. When you come up with a one or two-liner that can truly solve that, I'll be peeking over your shoulder...

grail · 02-10-2011, 03:17 AM

Quote:

When you come up with a one or two-liner that can truly solve that, I'll be peeking over your shoulder...

And you will be more than welcome

I am not really looking for a single line, as shown by previous awk scripts and the like, but as my distribution is still in an infancy state
I am building as I go to the more complicated issues (dependencies being one that is kicking the crap out of me still

).

My current system consists of all applications included in CLFS pure64 and a few extras from CBLFS (like python and upstart). Luckily for me, currently all files being dealt with
are using the above source formats (except the rpm one which i threw in because it does not contain the word 'tar'). A close rendition of the rest of the code can be found here

I can tell you that whilst aware of things like svn, git and so on, all files are currently retrieved by wget (have looked at curl but still working on that too).

I will definitely let you know what things I come up with though

crts · 02-10-2011, 05:46 AM

Hi grail,

does it really have to be ONE RegEx? How about something like

Code:

sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.(src\.|tar\.|)([bglxtr][gpz][z2m]?a?)$/\1|\3/;s/\.src|\.tar// '

Based on your example, it does return what you want. However, maybe some more sample data would be helpful to further test it.

Well, if you really need ONE RegEx then this does also return the same results as the above:

Code:

sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.((src\.|tar\.)([bglxr][pz][2m]?a?)|(tgz))$/\1|\4\5/'

On a sidenote: While a version like "Release Candidate 1" is probably pretty stable there still *might* be some issues with it. So maybe a revision of the algorithm that determines the latest version is something to keep in mind. Another issue is that the latest version might be marked alpha or beta.

[EDIT]

Quote:

We can assume the only known information will be the application name, in this case Python.

Does this mean we can't assume that 'src' and/or 'tar' is known?
In this case

Code:

sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.(([^\.]+\.)([bglxr][pz][2m]?a?)|(tgz))$/\1|\4\5/'

grail · 02-10-2011, 07:45 AM

Hey crts ... thanks for chiming in and I will get back to you on all that you have shown above once tested further

Quote:

Does this mean we can't assume that 'src' and/or 'tar' is known?

This is correct if we start to expand further as gnashley has said. It is also why i am trying to steer away from putting in things like '|(tgz)'.
Obviously this assumes that tgz is the only time that an extension appears directly after the version (for example zip)

Also, whilst i take on what you have said with regards to 'Release Candidates', we cannot ignore that a version could have alpha items (for example texinfo is texinfo-4.13a.tar.gz)
Hopefully this will give you some ideas as it was I have been playing with:

Code:

$ awk 'match($0, /Python-([0-9][[:alnum:].-]+)([.][^.]+)?[.]([^.]+)$/,f){ print f[1]"|"f[3]}' file
3.2rc1.src|rpm
3.2rc1.tar|bz2
3.2rc1.tar|gz
3.2rc1.tar|lzma
3.2rc1.tar|xz
3.2rc1|tgz

$ awk 'match($0, /Python-([0-9][[:alnum:].-]+)([.][^.]+)[.]([^.]+)$/,f){ print f[1]"|"f[3]}' file
3.2rc1|rpm
3.2rc1|bz2
3.2rc1|gz
3.2rc1|lzma
3.2rc1|xz

As you can see my dilemma is that if I say there 'could' zero or one then I get the version plus extra if there (ie tar or src in these examples) but, of course, once the '?'
is removed then I lose items that have extension straight after the version (like tgz)

grail · 02-26-2011, 09:31 AM

So I just thought I would throw an update in on this one. For the time being I am focusing on only compressed files and will tackle things like src.rpm at a later date, mainly due
to the fact that as the distro is only new and all applications currently do not require to be non-compressed files.

So with the above in mind, my current awk script looks like:

Code:

#!/usr/bin/awk -f

BEGIN{
    IGNORECASE = 1
    PRIORITY = "xz lzma bz2 gz tgz"
    VSORT = "sort -V | tail -n1"
}

match($0,page".([[:alnum:]._-]+)[.](t(ar[.])?([[:alnum:]]+))",f) && f[0] !~ remove{
    for(i = 0; f[i] != ""; i++)

    arr[f[1],f[i]]++
    if(list !~ f[1])
        list = (list)?list"\n"f[1]:f[1]
}


END{
    n = split(PRIORITY, extensions)
    print list |& VSORT
    close(VSORT, "to")
    VSORT |& getline last
    close(VSORT)

    for(j = 1;j <= n;j++){
        if(arr[last,extensions[j]]){
            print last,extensions[j]
            break
        }
    }
}

If anyone has any thoughts I am happy to listen

grail · 03-01-2011, 06:33 PM

Had to make a few small changes to the match line:

Code:

#previous
match($0,page".([[:alnum:]._-]+)[.](t(ar[.])?([[:alnum:]]+))",f) && f[0] !~ remove{

#new
match($0,page"-([0-9][[:alnum:]._-]+)[.](t(ar[.])?([[:alnum:]]+))",f) && $0 !~ remove{

Change information (in order):

1. - , found when I was wgetting from an ftp site that the . in previous version was being taken up by part of the path (namely /). Will need to see what separators come to light as I go along.

2. [0-9] , some sites have other tarred files, ie not just the source, hence limiting to starting with a digit seems to cover versions of software

3. $0 , using the matched portion, f[0], did not allow for the fact that the match will not consider past the compression, eg. blah-version.tar.gz.sig

If anyone is interested in testing, as will close thread now, following is process:

Code:

$ wget -O <file_name> <url_to_all_source_for_software>
$ ./script.awk -vpage="<source_name>" -vremove="latest" <file_name>
version compression

Both page and remove are required. remove can also be pipe separated if other items need to be ignored. Using example from 3. above - -vremove="latest|sig"

Live example:

Code:

$ wget -O python http://www.python.org/ftp/python/3.2/
$ ./script.awk -vpage="python" -vremove="latest" python
3.2rc3 xz