LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   BASH Sort list by end of line to x position in each line? (http://www.linuxquestions.org/questions/programming-9/bash-sort-list-by-end-of-line-to-x-position-in-each-line-827123/)

SilversleevesX 08-18-2010 10:46 PM

BASH Sort list by end of line to x position in each line?
 
I'm trying to make another file annotation script a little speedier than it has been by the up-until-now proven method of checking the last four characters in a filename before the "dot" (eg .jpg, .psd) against a list of known IPTC categories and Exiv2 command files. It occurred to me that if one script generated a list of files in directory foo, and the same or another script sorted that list by that four-letter tag, then that list could be used (instead of a for/do/done loop on the real files in the folder) by the command-file-matching script to "vomit out" which annotator file would go with file nastynewfile.jpg, f'r'instance.

The script I had been using for this task looks like this:
Code:

while read 'line';
do
        sp=$(echo $line)
        vc=$(echo $sp | cut -d"," -f1)
        cv=$(echo $sp | cut -d"," -f2)
        dv=$(echo $sp | cut -d"," -f3)

        for striM in $(ls *jpg);
        do
       
                k=$(echo $striM | cut -d'.' -f1)
                j=$(echo ${k:(-4)})
                m=$(echo ${k%????})
                if grep -q $vc <<<$striM; then
                        matchX=$striM
                    echo -e "I match $matchX with command file '$cv': \033[1;36m$dv.\033[0m"
                        echo -e "$matchX:$cv">>/cygdrive/c/blu/newest/tagmatch.txt
                fi
        done
done</cygdrive/c/blu/newest/nbin/catstag
cd /cygdrive/c/blu/newest
sort -t":" -k1 tagmatch.txt>tagmatch-sorted.txt
rm tagmatch.txt
mv tagmatch-sorted.txt tagmatch.txt

while the acorn of my new script looks like this:
Code:

touch templist
for file in $(ls *jpg)
do
        echo -e $file>>templist
done

g=1
f=$(cat templist | wc -l)
while [ g -le $f ];
do

Where I seem to be stuck is with how to sort the lines in templist, which may be any number of different lengths, from back to front. sort -k looked promising, except it seems only to work the other way round. I thought of invoking a
Code:

q=$(expr length $line); echo $q
n=$[q-8]; echo $n

kind of thing, but that presented the problems of how to sort by those, how to tell sort where to find them (grep?) and how to "stitch them back in" to the original list, which is what I want to sort in the first place.

Any help moving this forward would be much appreciated.

BZT

konsolebox 08-18-2010 11:22 PM

Can you give us the contents of /cygdrive/c/blu/newest/nbin/catstag?

Also, you don't have to be very conservative when choosing variable names. You can always have $SOMETHING_LONG_AND_DESCRIPTIVE. It doesn't really affect speed. It can also people who will help you, understand your code easily.

Some quick tips:
Code:

f=0
for file in *jpg
do
        let f++
        echo -e "$file"
done > templist

g=1
while [[ g -le f ]];
do

In simple commands, always place variables around double quotes ("$var").

ghostdog74 08-18-2010 11:33 PM

you can "speed up" your code by reducing calls to external commands. Those are unnecessary, like cut and grep to search for a string in a string. When you want to assign a variable to another variable, no need to use "echo". Also, don't use ls to list your files with a for loop. Use shell expansion. for finding patterns, you can use case/esac instead of grep.
Code:


while read -r line;
do
        sp="$line"
        OLDIFS="$IFS"
        IFS=","
        set -- $sp
        vc=$1
        cv=$2
        dv=$3
        IFS="$OLDIFS"
        for striM in *jpg
        do
                OLDIFS="$IFS"
                IFS="."
                set -- $striM
                k=$1
                j=${k:(-4)}
                m=${k%????}
                case "$striM" in
                    *$vc* ) matchX=$striM
                        echo "whatever ... "
                    ;;
                esac
                IFS="$OLDIFS"
        done
done</cygdrive/c/blu/newest/nbin/catstag


jschiwal 08-18-2010 11:37 PM

Quote:

for file in $(ls *jpg)
for file in *.jpg
does the same thing. The results will be sorted as well.

If there will be 20,000 jpg files in the directory, you can run out of memory on either line, since both expand the * before passing the command line to the command. (e.g. the vargs array)

If you want to remove or replace the extension of a filename, simply use variable substitution
file=picture1.jpg
annotation="${file%.jpg}.ano"

file=picture.jpg
convert "$file" "${file%.jpg}".png
Quote:

touch templist
Touching a file will create it if it doesn't exist, but won't zero it out if it does. You could use this:
: >templist

If you need to revisit the list of pictures, you might consider creating an array and iterating over the array:
pictures=(*.jpg *.JPG);

konsolebox 08-18-2010 11:39 PM

I think also that everything will be easier if you use arrays. Your script is easy to simplify but more info about it is needed.

more tips:
Code:

jpegs=(*.jpg)
njpegs=${#jpegs[@]}
IFS=$'\n'
echo "${jpegs[*]}" | sort
...


SilversleevesX 08-19-2010 02:33 AM

Sample of catstag
 
About two dozen lines culled from /catstag

Code:

asia,asian,Asian Impressions
beac,beach,On The Beaches
biki,bikini,Bikini Girls
blac,black,Black side of beauty
blon,blonde,A few Blonde moments
boat,boat,Boating Beauties
bpnt,bpaint,Painted Bodies
flas,flash,Showing The Goods
frfm,french,French Females
glas,glasses,Girls in Glasses
gtog,gtog_gwg,Home Girls & Best Girl Buddies
indi,indian,O India!
isms,ism,Self-Shooting Sweeties
lati,latina,Latin Chattin
natu,nature,Great Outdoors
nchx,nudechix,Cute & Sexy
nero,nero,Naked Erotica
nipb,nip,Nudes-In-Public
pool,pool,Pools Rule
preg,preg,Mothers-To-Be
redh,redhead,Nice & Fiery Redheads
russ,russian,Daughters of Russia
sill,silly,Getting Goofy
snow,snow,Snow Bunnies

Also, "long and descriptive" variable names is an excellent suggestion. I've also considered the enormous possibilities in letter-number combinations for variable names. If my math is right, keeping oneself strictly to single letters (24 if you omit "i" and "o") and one- and two-digit numbers (including zero), you can easily come up with over 2400 combinations. You're hardly likely to run out of possibilities, that's for sure.

Still, descriptive, rather than contrived and thematic (my weakness when coming up with variable names is to take the latter route) or simplified like the approach I just mentioned, variable naming could be the best way to go all around. I'll definitely give it some thought.

BZT

grail 08-19-2010 04:13 AM

Just to follow up on others points here, using a for loop to look for files is subject to word splitting, which I know from past experience is a definite
issue with your files.

SilversleevesX 08-19-2010 01:36 PM

ghostdog:

I just tried your all-internals rewrite of my original script (I just had to add the sort -t":" stuff from my old script to make it nearly perfect). I love the speed, but I can't puzzle out how it missed 2 out of 45 JPEGs in the folder I ran it on. I had to add their filenames and command-file names to taglist.txt by hand. No great shakes, and when I think back to how plodding the old script was, I'll take the speed and deal with only 99.98% accuracy any day. You got a winner with that one.

BZT

SilversleevesX 08-19-2010 02:33 PM

Spoke too soon.
The gap is getting wider: now it's 46 out of 49 JPEGs, and on some runs of the script, tagmatch.txt is only 24 lines long. All the files in the folder have the requisite 4-letters-before-the-dot "hint" to the command files in the other folder, yet somehow doing this by way of internal commands and functions consistently means omitting or ignoring anywhere from 2 to 25 items that should be matching (the other script got all of them, slow as it was). I think I might need some help with reiterating over arrays, as an internal double-check before dumping to the list looks like it's necessary.

BZT

Quote:

Originally Posted by SilversleevesX (Post 4071524)
ghostdog:

I just tried your all-internals rewrite of my original script (I just had to add the sort -t":" stuff from my old script to make it nearly perfect). I love the speed, but I can't puzzle out how it missed 2 out of 45 JPEGs in the folder I ran it on. I had to add their filenames and command-file names to taglist.txt by hand. No great shakes, and when I think back to how plodding the old script was, I'll take the speed and deal with only 99.98% accuracy any day. You got a winner with that one.

BZT


SilversleevesX 08-19-2010 02:54 PM

Willing to try it .. where do I put it?
 
Quote:

Originally Posted by konsolebox (Post 4070857)
I think also that everything will be easier if you use arrays. Your script is easy to simplify but more info about it is needed.

more tips:
Code:

jpegs=(*.jpg)
njpegs=${#jpegs[@]}
IFS=$'\n'
echo "${jpegs[*]}" | sort
...


Willing to try it .. where do I put it?

SilversleevesX 08-19-2010 03:07 PM

Just saw what started the "gapping" - I had some unmatch-able "hinters" in the filenames. catstag had the right ones, and as I was downloading I simply didn't think to check it to see if there were more "descriptive" (there's that word again!) four-letter end-offs to the pics I had in front of me.

My bad. Now the script is perfect. I'll just have to keep checking back to catstag until the right hinters become second-nature.

BZT

Quote:

Originally Posted by SilversleevesX (Post 4071577)
Spoke too soon.
The gap is getting wider: now it's 46 out of 49 JPEGs, and on some runs of the script, tagmatch.txt is only 24 lines long. All the files in the folder have the requisite 4-letters-before-the-dot "hint" to the command files in the other folder, yet somehow doing this by way of internal commands and functions consistently means omitting or ignoring anywhere from 2 to 25 items that should be matching (the other script got all of them, slow as it was). I think I might need some help with reiterating over arrays, as an internal double-check before dumping to the list looks like it's necessary.


SilversleevesX 08-19-2010 03:38 PM

Not perfect yet.
 
Got the old problem fixed, now there's a new one.
The part of the file name that this script is supposed to pay attention to is the "hinter," and if it doesn't, and you have a file with a name such as sillygf-002-009-014blon.jpg, what do you suppose this script
Code:

while read -r line;
do
        sp="$line"
        OLDIFS="$IFS"
        IFS=","
        set -- $sp
        vc=$1
        cv=$2
        dv=$3
        IFS="$OLDIFS"
       
        for striM in *jpg
        do
                OLDIFS="$IFS"
                IFS="."
                set -- $striM
                k=$1
                j=${k:(-4)}
                m=${k%????}
                case "$striM" in
                    *$vc* ) matchX=$striM
                        echo -e "I match $matchX with command file '$cv': \033[1;36m$dv.\033[0m"
                        echo -e "$matchX:$cv">>/cygdrive/c/blu/newest/tagmatch.txt
                    ;;
                esac
                IFS="$OLDIFS"
        done
done</cygdrive/c/blu/newest/nbin/catstag

will return for matches on it? An almost-straight export on the command line
Code:

sort -u -t":" -k1 tagmatch.txt
gave me this:
Code:

...
sillygf-002-009-014blon.jpg:blonde
sillygf-002-009-014blon.jpg:silly
...

and 50 lines in a sorted temp file tagmatch-sorted.Recall there are only 49 JPEG files in the folder at this time.

So how to re-focus? I suspect it's worth taking a look at tweaking the case loop, or adding another 'case' that makes it look at that j variable, set to (if I read that part of the script right) the last four letters before the dot in the filename, before proceeding.

Back and forth like a kid on a swing. Well, at least there's good weather for it ;)

BZT

Quote:

Originally Posted by SilversleevesX (Post 4071617)
Just saw what started the "gapping" - I had some unmatch-able "hinters" in the filenames. catstag had the right ones, and as I was downloading I simply didn't think to check it to see if there were more "descriptive" (there's that word again!) four-letter end-offs to the pics I had in front of me.

My bad. Now the script is perfect. I'll just have to keep checking back to catstag until the right hinters become second-nature.

BZT


konsolebox 08-19-2010 06:14 PM

Honestly I no longer want to do this after I saw the contents of your post. But anyway I still made some minor related words that could turn into a lie so here I made it:
Code:

#!/bin/bash

CATSTAG=/cygdrive/c/blu/newest/nbin/catstag
TAGMATCH=/cygdrive/c/blu/newest/tagmatch.txt

declare -i I=0
declare -a TAGS0=() TAGS1=() TAGS2=()

OLDIFS=$IFS IFS=,
while read -r 'TAGS0[I]' 'TAGS1[I]' 'TAGS2[I]'; do
        (( I++ ))
done < "$CATSTAG"
unset 'TAGS0[I]' 'TAGS1[I]' 'TAGS2[I]'  # read might allocate empty value
IFS=$OLDIFS

for FILE in *.jpg *.JPG; do
        TAG=${FILE: -8:4}

        for I in "${!TAGS0[@]}"; do
                if [[ $TAG = "${TAGS0[0]}" ]]; then
                        echo -e "I match $FILE with command file '${TAGS1[I]}': \033[1;36m${TAGS1[I]}.\033[0m"
                        echo "$FILE:${TAGS1[I]}" >&3
                fi
        done
done 3> >(exec sort -u -t: -k1 > "$TAGMATCH")

alt.:
Code:

...
                [[ $TAG = "${TAGS0[0]}" ]] || continue
                echo -e "I match $FILE with command file '${TAGS1[I]}': \033[1;36m${TAGS1[I]}.\033[0m"
                echo "$FILE:${TAGS1[I]}" >&3
...

Please don't use the script as is and just use it as a reference to make a version of your own.

grail 08-19-2010 07:51 PM

Well everyone else has had a crack ... here is a variation on a theme (need bash 4+ probably)
Code:

#!/bin/bash

CATSTAG=/cygdrive/c/blu/newest/nbin/catstag
TAGMATCH=/cygdrive/c/blu/newest/tagmatch.txt

declare -A TAG_NAME TAG_DESC

while IFS=, read -r id name desc
do
    TAG_NAME[$id]="$name"
    TAG_DESC[$id]="$desc"
done<"$CATSTAG"

for file in *.jpg
do
    tag=${file:(-8):4}

    (( ${#TAG_NAME[$tag]} )) &&
        echo -e "I match $file with command file '${TAG_NAME[$tag]}': \033[1;36m${TAG_DESC[$tag]}.\033[0m" &&
        echo -e "$file:${TAG_NAME[$tag]}">>"$TAGMATCH"
done


konsolebox 08-19-2010 08:30 PM

Good variation.


All times are GMT -5. The time now is 09:30 PM.