LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   [BASH] Sort while ignoring "The" (https://www.linuxquestions.org/questions/programming-9/%5Bbash%5D-sort-while-ignoring-the-4175468957/)

soupmagnet 07-09-2013 01:05 AM

[BASH] Sort while ignoring "The"
 
I have a list of maybe a thousand or more movies that I want sorted, but some of the titles begin with the words "The" or "A", which makes finding the movie your looking for more complicated than I'd like.

Is it possible to sort the list, while ignoring words like "The" or "A", or (ideally) dropping the words and appending them to the end (i.e Movie Title, The)

Oh, there's one more thing I didn't mention...Each movie's title will begin with, lets say, a not-so-random a series of symbols used as a code for matching and classification (i.e. [*][+][##][ ] The Movie Title). If needed, I could change the pattern(s) of symbols to be represented instead by numbers if it would make it easier, but I'd like to keep the symbols if possible.

sag47 07-09-2013 01:21 AM

You could use sed to attach "The" or "A" to the end of the name and separate it from the file with a semi-colon ( ; ). Then you can use sort alphabetically. Reorganize the file names using awk. Awk will place "The" or "A" back to the beginning of the file name.

One-liner solution...
Code:

ls -1 | sed 's/^\(The \|A \)\(.*\)/\2;\1/' | sort | awk 'BEGIN{FS=";"};$0 ~ /;/{print $2 $1};$0 !~ /;/{print $0}'
Sed script broken down with comments....
Code:

#run substitute command
s/

#with regex from the beginning of the line match "The " or "A " with group 1, match the rest with group 2
#A group is designated by parenthesis
^\(The \|A \)\(.*\)

/

#The replacement string swaps group 1 and group 2.  It also places a semi-colon between the groups.
\2;\1

/

Awk script broken down with comments...
Code:

#$0 is the whole line, $1 is the first field, $2 is the second field (field separator is a space by default)

#before processing any lines make the field separator a semi-colon
BEGIN {
  FS=";"
}

#if the line contains a semi-colon then reorganize it (regex match)
$0 ~ /;/ {
  print $2 $1
}

#if the line does not contain a semi-colon then just print it (not regex match)
$0 !~ /;/ {
  print $0
}



**********EDIT

I noticed my one liner did not account for your strange "begins with weird symbols" request, e.g. "[asdf] The Movie.file" I'll attempt to adapt my commands.

ONE-LINER SOLUTION #2
Code:

ls -1 | sed 's/[._]/ /g; s/^\(\[[^]]*\]\s*\)\(.*\)/\2;\1/; s/^\(The \|A \)\(.*\)/\2;\1/' | sort | awk 'BEGIN{FS=";"} NF == 1 {print $0} NF == 2 {print $2 $1} NF == 3 {print $2 $3 $1}'
Now sed does three replacements on a single line.
Code:

#replace all periods and underscores with spaces
s/[._]/ /g;

#reorganize "[asdf] The Movie.file" to "The Movie.file;[asdf]"
s/^\(\[[^]]*\]\s*\)\(.*\)/\2;\1/;

#reorganize "The Movie.file;[asdf]" to "Movie.file;[asdf];The"
s/^\(The \|A \)\(.*\)/\2;\1/

After doing the sort I changed the awk script a little bit. Now it detects if there's 1, 2, or 3 fields using the number of fields variable (NF).
Code:

#set field separator to semi-colon
BEGIN {
 FS=";"
}

#If number of fields = 1 then just print the whole line
NF == 1 {
  print $0
}

#If number of fields = 2 then just swap them in a manner that renders the original file name.
NF == 2 {
  print $2 $1
}

#If number of fields = 3 then swap them in a manner that renders the original file name.
NF == 3 {
  print $2 $3 $1
}

That *should* do what you want. I made the choice to replace periods and underscores with spaces because accounting for them would have just made the one liner too complicated (as if it wasn't complicated enough!)

konsolebox 07-09-2013 01:59 AM

There's a way to do this purely in Bash and Bash alone as you require, but let's have a concept first. If you want, you could base it to create scripts for Awk or other languages like Ruby instead. You could expect other commands or newer version of known commands to solve this, but it might not be available always.

One way to do it is to map the strings to an associative array where keys are already trimmed with common words like "A" and "The", and punctuation marks like *, +, #, etc.

From there you could sort those key strings either by another indexed array or just sort them with the sort command through regeneration by echo.

Once those keys are sorted you can then base from those to reprint the keys in a sorted form.

An example of it would be like this:
Code:

#!/bin/bash

[[ BASH_VERSINFO -ge 4 ]] || {
    echo "This script requires Bash version 4.0 or newer."
    exit 1
}

shopt -s extglob

declare -A MAP

K=0

# Map contents

while read -r TITLE; do
    KEY=${TITLE##+([[:cntrl:][:punct:][:blank:]])}
    KEY=${KEY#@(The|A)*([[:blank:]])}
    [[ -z $KEY ]] && KEY=$TITLE
    MAP[$KEY]=$TITLE
    MAP_KEYS[K++]=$KEY
done

# Sort the keys and print sorted list

while read -r KEY; do
    echo "${MAP[$KEY]}"
done < <(IFS=$'\n'; echo "${MAP_KEYS[*]}" | sort)

And you can run that with
Code:

bash script.sh < input_list.txt > output_list.txt
Finally we could use array_sort to sort MAP_KEYS without using an external sorter command.

soupmagnet 07-09-2013 10:22 AM

Thank you both for these responses.

@konsolebox, I hadn't really thought of using associative arrays in this situation but it makes a lot more sense than what I had planned.

ta0kira 07-09-2013 11:10 AM

You can also just associate via a delimiter:
Code:

#!/usr/bin/env bash

#only read the data once, from standard input
lines=$(cat)

#remove "the", "a", and "an"
fixed=$(echo "$lines" | sed -r 's/( |^)([Tt]he|[Aa](n|)) /\1/g')

#find the order that the lines need to be put in
order=$(echo "$fixed" | grep -n . | sort -t: -k2,2 | grep -n . | sort -g -t: -k2,2 | cut -d: -f 1)

#put the lines in order
paste -d: <(echo "$order") <(echo "$lines") | sort -g -t: -k1,1 | cut -d: -f2

You can do this without bash if you use temp files instead of <(...). If you have ":" in the titles, you'd have to translate the first ":" to something else after each grep -n ..

Kevin Barry

grail 07-09-2013 02:24 PM

An awk solution as konsolebox had suggested:
Code:

ls /path/to/files/ | awk '{m[gensub(/^(A|The)\./,"","1")]=$0}END{asorti(m,a);for(i=1;i <= length(m);i++)print m[a[i]]}'
Obviously alter the gensub to account for anything else to be removed from the front.

konsolebox 07-09-2013 07:17 PM

Just some corrections:
Code:

    KEY=${KEY#@(The|A)*([[:blank:]])}
Should be
Code:

    KEY=${KEY##@(The|A)+([[:blank:]])}
We also must ignore empty lines (and probably blank lines too). And we should also be careful not to override current values when keys become similar so:
Code:

#!/bin/bash

[[ BASH_VERSINFO -ge 4 ]] || {
    echo "This script requires Bash version 4.0 or newer."
    exit 1
}

shopt -s extglob

declare -A MAP

K=0

# Map contents

while read -r TITLE; do
    [[ $TITLE != *([[:blank:]]) ]] || continue
    KEY=${TITLE##+([[:cntrl:][:punct:][:blank:]])}
    KEY=${KEY##@(The|A)+([[:blank:]])}
    [[ -z $KEY ]] && KEY=$TITLE
    KEY_ORIG=$KEY I=0
    until [[ -z ${MAP[$KEY]} ]]; do
        KEY=${KEY_ORIG}$(( I++ ))
    done

    MAP[$KEY]=$TITLE
    MAP_KEYS[K++]=$KEY
done

# Sort the keys and print sorted list

while read -r KEY; do
    echo "${MAP[$KEY]}"
done < <(IFS=$'\n'; echo "${MAP_KEYS[*]}" | sort)


ta0kira 07-09-2013 08:42 PM

You also need $BASH_VERSINFO instead of BASH_VERSINFO.

Kevin Barry

konsolebox 07-09-2013 09:14 PM

Quote:

Originally Posted by ta0kira (Post 4987401)
You also need $BASH_VERSINFO instead of BASH_VERSINFO.

No that's no longer needed. In [[ BASH_VERSINFO -ge 4 ]], BASH_VERSINFO is equivalent to ${BASH_VERSINFO[0]}. Not only is it a simple form but also helps produce the error message on simpler shells which would see the latter format as a syntax error.

grail 07-10-2013 04:57 AM

hmmm ... that doesn't seem to work for me, I get a nasty set of error messages:
Code:

$ [[ BASH_VERSINFO >= 4 ]] && echo yes
bash: syntax error in conditional expression
bash: syntax error near `4'

However, an easy solution is to use (()) which are meant for arithmetic expressions:
Code:

$ (( BASH_VERSINFO >= 4 )) && echo yes
yes


sag47 07-10-2013 05:51 PM

I guess a good question to ask is are the file names consistent? Do they always have "[stuff] name.file" or are they sometimes just "name.file" with no stuff?

ta0kira 07-10-2013 05:56 PM

Quote:

Originally Posted by grail (Post 4987578)
hmmm ... that doesn't seem to work for me, I get a nasty set of error messages:
Code:

$ [[ BASH_VERSINFO >= 4 ]] && echo yes
bash: syntax error in conditional expression
bash: syntax error near `4'

However, an easy solution is to use (()) which are meant for arithmetic expressions:
Code:

$ (( BASH_VERSINFO >= 4 )) && echo yes
yes


I believe that's related to >= and not BASH_VERSINFO.

Kevin Barry

konsolebox 07-10-2013 07:50 PM

Quote:

Originally Posted by grail (Post 4987578)
However, an easy solution is to use (()) which are meant for arithmetic expressions:

Well arithmetic expression is different from conditional expression despite that (( )) is meant for numbers, and [[ ]] is meant for conditional expressions. Also, unfortunately in earlier versions of (( )) I had encountered a problem in which (( )) didn't work the way it was expected. I thought it would have been a convenience but it wasn't.

grail 07-11-2013 02:15 AM

My bad, thanks ta0kira for pointing out the oversight. Strangely enough it does seem to work without the $ at the front when using -ge for the test.

konsolebox 07-11-2013 08:27 PM

Much as how (( )) interprets arithmetic expressions without the need of $, so does [[ ]] with arithmetic comparisons.

Memory tells me that around 2006 or 2008 when I had attempted to convert my [[ A -xx -B ]] expressions to (( )), (( )) just returned the same exit code no matter what the expression was. I hope it was actually just a mistake on my part since even now I can't reproduce the same error. Yet I can't help being careful and have doubts with it already.

I actually considered using (( )) for a while already for some arithmetic comparisons where some other expressions can't be handled by [[ ]]. That said I still respect [[ ]] as the main tool for conditional expressions, but for more complex comparisons where we had to enclose expressions in () like (( (A + 4) % 5 < B )) which would be a convenience than having to use a slower re-evaluating sub-expression like [[ $(( (A + 4) % 5 )) -lt B ]]. Some simpler expressions like changing the variable's value on the value could be done in [[ ]] like [[ A+=2 -lt X ]] but I dislike the style inconsistency for not being able to add spaces.

As for (( BASH_VERSINFO >= 4 )), we can't have that as an alternative for [[ BASH_VERSINFO -ge 4 ]] since some shells could see (( )) as a syntax error for ( ). It's not only about the message being shown but also about how we could prevent commands after it to be executed or misinterpreted in other shells, especially those that cause irrevocable changes.


All times are GMT -5. The time now is 08:29 PM.