ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
So something I was working on I thought might help get the ball rolling a little:
Code:
#!/bin/bash
# this is using the output file from original post
UPDATE=($(awk -F"[.]t" '{if(x == $1)arr = arr" "$0;else{ x = $1;arr=$0}}END{printf arr}' output))
for ext in xz lzma bz2 gz tgz
do
for file in ${UPDATE[*]}
do
[[ ${file##*.} == $ext ]] && break 2
done
done
echo $file
Have you considered sorting? You might try something like this:
Code:
#!/bin/bash
PRIORITY=( xz lzma tar.bz2 tar.gz tgz ) #<-- you only need to change this to change priorities
COUNT=$( echo $((${#PRIORITY[*]}-1)) )
#read a newline-separated list of file names
if [ $# -gt 0 ]; then
UPDATE="$( cat "$1" )"
else
UPDATE="$( cat )"
fi
#translate extensions into numerical priority (separated with "'"; assume it's not part of a file name)
for I in `eval echo {0..$COUNT}`; do
UPDATE="$( echo "$UPDATE" | sed "s/\.${PRIORITY[$I]}$/'$I/" )"
done
#make a missing extension last priority
UPDATE="$( echo "$UPDATE" | sed "/'/! s/$/'$( echo $(($COUNT+1)) )/" )"
#sort, then keep only the highest priority for each file
DOWNLOAD="$( echo "$UPDATE" | sort | sort -t\' -u -k1,1 )"
#translate priority back to extension
for I in `eval echo {0..$COUNT}`; do
DOWNLOAD="$( echo "$DOWNLOAD" | sed "s/'$I$/\.${PRIORITY[$I]}/" )"
done
DOWNLOAD="$( echo "$DOWNLOAD" | sed "s/'$( echo $(($COUNT+1)) )//" )"
#output results
echo "$DOWNLOAD"
Of course, you'd have to deal with extensions other-than those listed differently, e.g. remove those files from the list before processing it.
You got me thinking and I have a new solution which combines both consolidation of the names from the website along with setting the compression to our order.
Let me know what you think:
Code:
#!/usr/bin/awk -f
BEGIN{
IGNORECASE = 1
PRIORITY = "xz lzma bz2 gz" # have left off tgz as it will be found by gz anyway
}
{ match($0, page"-[0-9][[:alnum:].-]+[bglx]z(2|ma)?", file) } # page is a passed in variable
length(file) > 0{
version = gensub(/^.*-|[.]t.*$/,"","g",file[0])
if(version in temp_arr){
if(temp_arr[version] !~ file[0])
temp_arr[version] = temp_arr[version]" "file[0]
}
else
temp_arr[version] = file[0]
}
END{
split(list_ext,extensions)
n = asorti(temp_arr, sorted_temp_arr)
split(temp_arr[sorted_temp_arr[n]], files)
for(i = 1;i <= length(extensions); i++)
for(j = 1;j <= length(files); j++)
if(files[j] ~ extensions[i]"$"){
print files[j]
exit 0
}
print "Extension is not part of list
}
So test you can simply download a listing for any software into a file and then run.
To test based on our python site I have used the following:
So it didn't take too long to find that if the best compression, based on my order, is on a lower version and not on the higher one then I will actually roll back
the version
So my new line of thinking here is to get the egrepped and version sorted data and try to manipulate this.
This has now raised a new question based on regular expressions which has me stuck.
Lets say we have reduced the list to only the highest version so that the input looks like:
Code:
Python-3.2rc1.src.rpm # I know this does not really exist but I am trying to allow for it
Python-3.2rc1.tar.bz2
Python-3.2rc1.tar.gz
Python-3.2rc1.tar.lzma
Python-3.2rc1.tar.xz
Python-3.2rc1.tgz
So my regex question is this - based on the above input, return only the version and extension with some kind of separator:
I have a nice two-hundred-liner is src2pkg which works out such names, LOL, including the 'src', 'source', 'git' ,etc. Your assumption that you can be sure of the name is a biig assumption... Then there are names like 'xterm224', names with '_' in them, sources with no version number at all. When you come up with a one or two-liner that can truly solve that, I'll be peeking over your shoulder...
When you come up with a one or two-liner that can truly solve that, I'll be peeking over your shoulder...
And you will be more than welcome I am not really looking for a single line, as shown by previous awk scripts and the like, but as my distribution is still in an infancy state
I am building as I go to the more complicated issues (dependencies being one that is kicking the crap out of me still ).
My current system consists of all applications included in CLFS pure64 and a few extras from CBLFS (like python and upstart). Luckily for me, currently all files being dealt with
are using the above source formats (except the rpm one which i threw in because it does not contain the word 'tar'). A close rendition of the rest of the code can be found here
I can tell you that whilst aware of things like svn, git and so on, all files are currently retrieved by wget (have looked at curl but still working on that too).
I will definitely let you know what things I come up with though
does it really have to be ONE RegEx? How about something like
Code:
sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.(src\.|tar\.|)([bglxtr][gpz][z2m]?a?)$/\1|\3/;s/\.src|\.tar// '
Based on your example, it does return what you want. However, maybe some more sample data would be helpful to further test it.
Well, if you really need ONE RegEx then this does also return the same results as the above:
Code:
sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.((src\.|tar\.)([bglxr][pz][2m]?a?)|(tgz))$/\1|\4\5/'
On a sidenote: While a version like "Release Candidate 1" is probably pretty stable there still *might* be some issues with it. So maybe a revision of the algorithm that determines the latest version is something to keep in mind. Another issue is that the latest version might be marked alpha or beta.
[EDIT]
Quote:
We can assume the only known information will be the application name, in this case Python.
Does this mean we can't assume that 'src' and/or 'tar' is known?
In this case
Code:
sed -r 's/Python[[:punct:]]*([[:digit:]]+.*)\.(([^\.]+\.)([bglxr][pz][2m]?a?)|(tgz))$/\1|\4\5/'
Hey crts ... thanks for chiming in and I will get back to you on all that you have shown above once tested further
Quote:
Does this mean we can't assume that 'src' and/or 'tar' is known?
This is correct if we start to expand further as gnashley has said. It is also why i am trying to steer away from putting in things like '|(tgz)'.
Obviously this assumes that tgz is the only time that an extension appears directly after the version (for example zip)
Also, whilst i take on what you have said with regards to 'Release Candidates', we cannot ignore that a version could have alpha items (for example texinfo is texinfo-4.13a.tar.gz)
Hopefully this will give you some ideas as it was I have been playing with:
As you can see my dilemma is that if I say there 'could' zero or one then I get the version plus extra if there (ie tar or src in these examples) but, of course, once the '?'
is removed then I lose items that have extension straight after the version (like tgz)
So I just thought I would throw an update in on this one. For the time being I am focusing on only compressed files and will tackle things like src.rpm at a later date, mainly due
to the fact that as the distro is only new and all applications currently do not require to be non-compressed files.
So with the above in mind, my current awk script looks like:
1. - , found when I was wgetting from an ftp site that the . in previous version was being taken up by part of the path (namely /). Will need to see what separators come to light as I go along.
2. [0-9] , some sites have other tarred files, ie not just the source, hence limiting to starting with a digit seems to cover versions of software
3. $0 , using the matched portion, f[0], did not allow for the fact that the match will not consider past the compression, eg. blah-version.tar.gz.sig
If anyone is interested in testing, as will close thread now, following is process:
Both page and remove are required. remove can also be pipe separated if other items need to be ignored. Using example from 3. above - -vremove="latest|sig"
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.