LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-29-2011, 12:36 AM   #16
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191

So something I was working on I thought might help get the ball rolling a little:
Code:
#!/bin/bash

# this is using the output file from original post
UPDATE=($(awk -F"[.]t" '{if(x == $1)arr = arr" "$0;else{ x = $1;arr=$0}}END{printf arr}' output))

for ext in xz lzma bz2 gz tgz
do
    for file in ${UPDATE[*]}
    do
        [[ ${file##*.} == $ext ]] && break 2
    done
done

echo $file
 
Old 01-29-2011, 01:17 PM   #17
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Have you considered sorting? You might try something like this:
Code:
#!/bin/bash


PRIORITY=( xz lzma tar.bz2 tar.gz tgz ) #<-- you only need to change this to change priorities
COUNT=$( echo $((${#PRIORITY[*]}-1)) )

#read a newline-separated list of file names
if [ $# -gt 0 ]; then
  UPDATE="$( cat "$1" )"
else
  UPDATE="$( cat )"
fi

#translate extensions into numerical priority (separated with "'"; assume it's not part of a file name)
for I in `eval echo {0..$COUNT}`; do
  UPDATE="$( echo "$UPDATE" | sed "s/\.${PRIORITY[$I]}$/'$I/" )"
done

#make a missing extension last priority
UPDATE="$( echo "$UPDATE" | sed "/'/! s/$/'$( echo $(($COUNT+1)) )/" )"

#sort, then keep only the highest priority for each file
DOWNLOAD="$( echo "$UPDATE" | sort | sort -t\' -u -k1,1 )"

#translate priority back to extension
for I in `eval echo {0..$COUNT}`; do
  DOWNLOAD="$( echo "$DOWNLOAD" | sed "s/'$I$/\.${PRIORITY[$I]}/" )"
done

DOWNLOAD="$( echo "$DOWNLOAD" | sed "s/'$( echo $(($COUNT+1)) )//" )"

#output results
echo "$DOWNLOAD"
Of course, you'd have to deal with extensions other-than those listed differently, e.g. remove those files from the list before processing it.

You can create a test list fairly easily:
Code:
echo file{1..5}{.{xz,lzma,tar.bz2,tar.gz,tgz},} | tr ' ' '\n' | sort -R | head -n20 > files.txt
Kevin Barry

PS Add INDEX="$( printf '%.2i\n' $I )" before the sed lines and use '$INDEX instead of '$I if you need more than 9 extensions in your list.

Last edited by ta0kira; 01-29-2011 at 02:01 PM. Reason: added "tar." to "bz2" and "gz", updated for missing extension, then got a little carried away
 
Old 01-31-2011, 12:11 AM   #18
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Thanks for the feedback Kevin

You got me thinking and I have a new solution which combines both consolidation of the names from the website along with setting the compression to our order.
Let me know what you think:
Code:
#!/usr/bin/awk -f

BEGIN{
    IGNORECASE = 1
    PRIORITY = "xz lzma bz2 gz" # have left off tgz as it will be found by gz anyway
}

{ match($0, page"-[0-9][[:alnum:].-]+[bglx]z(2|ma)?", file) } # page is a passed in variable

length(file) > 0{
    version = gensub(/^.*-|[.]t.*$/,"","g",file[0])
    if(version in temp_arr){
        if(temp_arr[version] !~ file[0])
            temp_arr[version] = temp_arr[version]" "file[0]
    }
    else
        temp_arr[version] = file[0]
}

END{
    split(list_ext,extensions)
    n = asorti(temp_arr, sorted_temp_arr)
    split(temp_arr[sorted_temp_arr[n]], files)

    for(i = 1;i <= length(extensions); i++)
        for(j = 1;j <= length(files); j++)
            if(files[j] ~ extensions[i]"$"){
                print files[j]
                exit 0
            }

    print "Extension is not part of list
}
So test you can simply download a listing for any software into a file and then run.
To test based on our python site I have used the following:
Code:
$ wget -O python http://www.python.org/ftp/python/3.2/
$ ./script.awk -vpage=python python
Python-3.2rc1.tar.xz
EDIT: Turns out this sorts on string values one at a time hence 10 is less than 9 and wrong version is returned
Back to the drawing board.

Last edited by grail; 01-31-2011 at 01:47 AM. Reason: failure
 
Old 01-31-2011, 03:10 AM   #19
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So here is take 2 ... and as always it is simpler .. funny how that works:
Code:
#!/usr/bin/awk -f

BEGIN{
    IGNORECASE = 1
    PRIORITY = "xz lzma bz2 gz" # have left off tgz as it will be found by gz anyway
}

{ match($0, page"-[0-9][[:alnum:].-]+[bglx]z(2|ma)?", file) } # page is a passed in variable

length(file) > 0 && file[0] !~ remove{
    ext = gensub(/^.*[.]/,"","1",file[0])
    if(ext in temp_arr){
        if(temp_arr[ext] !~ file[0])
            temp_arr[ext] = temp_arr[ext]"\n"file[0]
    }
    else
        temp_arr[ext] = file[0]
}

END{
    split(PRIORITY, extensions)

    for(i = 1;i <= length(extensions);i++)
        if(temp_arr[extensions[i]]){
            print temp_arr[extensions[i]] | "sort -V | tail -n 1"
            break
        }
}
Happy as usual for anyone to point out any obvious gotchas (if I don't find them first )

I guess the comparison now is to see if a bash only solution (after egrep and sort -V) is any easier or more sustainable?
 
Old 02-09-2011, 10:39 PM   #20
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So it didn't take too long to find that if the best compression, based on my order, is on a lower version and not on the higher one then I will actually roll back
the version

So my new line of thinking here is to get the egrepped and version sorted data and try to manipulate this.
This has now raised a new question based on regular expressions which has me stuck.
Lets say we have reduced the list to only the highest version so that the input looks like:
Code:
Python-3.2rc1.src.rpm    # I know this does not really exist but I am trying to allow for it
Python-3.2rc1.tar.bz2
Python-3.2rc1.tar.gz
Python-3.2rc1.tar.lzma
Python-3.2rc1.tar.xz
Python-3.2rc1.tgz
So my regex question is this - based on the above input, return only the version and extension with some kind of separator:
Code:
3.2rc1|rpm
3.2rc1|bz2
3.2rc1|gz
3.2rc1|lzma
3.2rc1|xz
3.2rc1|tgz
We can assume the only known information will be the application name, in this case Python.

I will post should I work it out, but happy to see others.

Note: I will be using match command from awk, but regex using sed or other is fine.
 
Old 02-10-2011, 02:26 AM   #21
gnashley
Amigo developer
 
Registered: Dec 2003
Location: Germany
Distribution: Slackware
Posts: 4,928

Rep: Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612
I have a nice two-hundred-liner is src2pkg which works out such names, LOL, including the 'src', 'source', 'git' ,etc. Your assumption that you can be sure of the name is a biig assumption... Then there are names like 'xterm224', names with '_' in them, sources with no version number at all. When you come up with a one or two-liner that can truly solve that, I'll be peeking over your shoulder...
 
Old 02-10-2011, 03:17 AM   #22
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Quote:
When you come up with a one or two-liner that can truly solve that, I'll be peeking over your shoulder...
And you will be more than welcome I am not really looking for a single line, as shown by previous awk scripts and the like, but as my distribution is still in an infancy state
I am building as I go to the more complicated issues (dependencies being one that is kicking the crap out of me still ).

My current system consists of all applications included in CLFS pure64 and a few extras from CBLFS (like python and upstart). Luckily for me, currently all files being dealt with
are using the above source formats (except the rpm one which i threw in because it does not contain the word 'tar'). A close rendition of the rest of the code can be found here

I can tell you that whilst aware of things like svn, git and so on, all files are currently retrieved by wget (have looked at curl but still working on that too).

I will definitely let you know what things I come up with though
 
Old 02-10-2011, 05:46 AM   #23
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Hi grail,

does it really have to be ONE RegEx? How about something like
Code:
sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.(src\.|tar\.|)([bglxtr][gpz][z2m]?a?)$/\1|\3/;s/\.src|\.tar// '
Based on your example, it does return what you want. However, maybe some more sample data would be helpful to further test it.

Well, if you really need ONE RegEx then this does also return the same results as the above:
Code:
sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.((src\.|tar\.)([bglxr][pz][2m]?a?)|(tgz))$/\1|\4\5/'
On a sidenote: While a version like "Release Candidate 1" is probably pretty stable there still *might* be some issues with it. So maybe a revision of the algorithm that determines the latest version is something to keep in mind. Another issue is that the latest version might be marked alpha or beta.

[EDIT]
Quote:
We can assume the only known information will be the application name, in this case Python.
Does this mean we can't assume that 'src' and/or 'tar' is known?
In this case
Code:
sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.(([^\.]+\.)([bglxr][pz][2m]?a?)|(tgz))$/\1|\4\5/'

Last edited by crts; 02-10-2011 at 06:07 AM.
 
Old 02-10-2011, 07:45 AM   #24
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Hey crts ... thanks for chiming in and I will get back to you on all that you have shown above once tested further
Quote:
Does this mean we can't assume that 'src' and/or 'tar' is known?
This is correct if we start to expand further as gnashley has said. It is also why i am trying to steer away from putting in things like '|(tgz)'.
Obviously this assumes that tgz is the only time that an extension appears directly after the version (for example zip)

Also, whilst i take on what you have said with regards to 'Release Candidates', we cannot ignore that a version could have alpha items (for example texinfo is texinfo-4.13a.tar.gz)
Hopefully this will give you some ideas as it was I have been playing with:
Code:
$ awk 'match($0, /Python-([0-9][[:alnum:].-]+)([.][^.]+)?[.]([^.]+)$/,f){ print f[1]"|"f[3]}' file
3.2rc1.src|rpm
3.2rc1.tar|bz2
3.2rc1.tar|gz
3.2rc1.tar|lzma
3.2rc1.tar|xz
3.2rc1|tgz

$ awk 'match($0, /Python-([0-9][[:alnum:].-]+)([.][^.]+)[.]([^.]+)$/,f){ print f[1]"|"f[3]}' file
3.2rc1|rpm
3.2rc1|bz2
3.2rc1|gz
3.2rc1|lzma
3.2rc1|xz
As you can see my dilemma is that if I say there 'could' zero or one then I get the version plus extra if there (ie tar or src in these examples) but, of course, once the '?'
is removed then I lose items that have extension straight after the version (like tgz)
 
Old 02-26-2011, 09:31 AM   #25
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
So I just thought I would throw an update in on this one. For the time being I am focusing on only compressed files and will tackle things like src.rpm at a later date, mainly due
to the fact that as the distro is only new and all applications currently do not require to be non-compressed files.

So with the above in mind, my current awk script looks like:
Code:
#!/usr/bin/awk -f

BEGIN{
    IGNORECASE = 1
    PRIORITY = "xz lzma bz2 gz tgz"
    VSORT = "sort -V | tail -n1"
}

match($0,page".([[:alnum:]._-]+)[.](t(ar[.])?([[:alnum:]]+))",f) && f[0] !~ remove{
    for(i = 0; f[i] != ""; i++)

    arr[f[1],f[i]]++
    if(list !~ f[1])
        list = (list)?list"\n"f[1]:f[1]
}


END{
    n = split(PRIORITY, extensions)
    print list |& VSORT
    close(VSORT, "to")
    VSORT |& getline last
    close(VSORT)

    for(j = 1;j <= n;j++){
        if(arr[last,extensions[j]]){
            print last,extensions[j]
            break
        }
    }
}
If anyone has any thoughts I am happy to listen
 
Old 03-01-2011, 06:33 PM   #26
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Original Poster
Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Had to make a few small changes to the match line:
Code:
#previous
match($0,page".([[:alnum:]._-]+)[.](t(ar[.])?([[:alnum:]]+))",f) && f[0] !~ remove{

#new
match($0,page"-([0-9][[:alnum:]._-]+)[.](t(ar[.])?([[:alnum:]]+))",f) && $0 !~ remove{
Change information (in order):

1. - , found when I was wgetting from an ftp site that the . in previous version was being taken up by part of the path (namely /). Will need to see what separators come to light as I go along.

2. [0-9] , some sites have other tarred files, ie not just the source, hence limiting to starting with a digit seems to cover versions of software

3. $0 , using the matched portion, f[0], did not allow for the fact that the match will not consider past the compression, eg. blah-version.tar.gz.sig

If anyone is interested in testing, as will close thread now, following is process:
Code:
$ wget -O <file_name> <url_to_all_source_for_software>
$ ./script.awk -vpage="<source_name>" -vremove="latest" <file_name>
version compression
Both page and remove are required. remove can also be pipe separated if other items need to be ignored. Using example from 3. above - -vremove="latest|sig"

Live example:
Code:
$ wget -O python http://www.python.org/ftp/python/3.2/
$ ./script.awk -vpage="python" -vremove="latest" python
3.2rc3 xz
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Highest load you've ever seen TheDude05 Linux - General 3 08-08-2009 11:33 PM
Slax with SquashFS-4 new compression algorithm and layered compression ratios? lincaptainhenryjbrown Linux - Software 2 06-19-2009 05:29 PM
Is gzip -c9 giving the highest compression? Thaidog Linux - General 2 04-23-2007 05:15 AM
highest salary manju_se7en General 2 04-21-2007 04:31 PM
highest salary manju_se7en Linux - General 3 04-21-2007 07:17 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:07 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration