LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 07-09-2013, 02:05 AM   #1
soupmagnet
LQ Newbie
 
Registered: Sep 2012
Posts: 27

Rep: Reputation: Disabled
[BASH] Sort while ignoring "The"


I have a list of maybe a thousand or more movies that I want sorted, but some of the titles begin with the words "The" or "A", which makes finding the movie your looking for more complicated than I'd like.

Is it possible to sort the list, while ignoring words like "The" or "A", or (ideally) dropping the words and appending them to the end (i.e Movie Title, The)

Oh, there's one more thing I didn't mention...Each movie's title will begin with, lets say, a not-so-random a series of symbols used as a code for matching and classification (i.e. [*][+][##][ ] The Movie Title). If needed, I could change the pattern(s) of symbols to be represented instead by numbers if it would make it easier, but I'd like to keep the symbols if possible.

Last edited by soupmagnet; 07-09-2013 at 02:07 AM.
 
Old 07-09-2013, 02:21 AM   #2
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,505
Blog Entries: 35

Rep: Reputation: 383Reputation: 383Reputation: 383Reputation: 383
You could use sed to attach "The" or "A" to the end of the name and separate it from the file with a semi-colon ( ; ). Then you can use sort alphabetically. Reorganize the file names using awk. Awk will place "The" or "A" back to the beginning of the file name.

One-liner solution...
Code:
ls -1 | sed 's/^\(The \|A \)\(.*\)/\2;\1/' | sort | awk 'BEGIN{FS=";"};$0 ~ /;/{print $2 $1};$0 !~ /;/{print $0}'
Sed script broken down with comments....
Code:
#run substitute command
s/

#with regex from the beginning of the line match "The " or "A " with group 1, match the rest with group 2
#A group is designated by parenthesis
^\(The \|A \)\(.*\)

/

#The replacement string swaps group 1 and group 2.  It also places a semi-colon between the groups.
\2;\1

/
Awk script broken down with comments...
Code:
#$0 is the whole line, $1 is the first field, $2 is the second field (field separator is a space by default)

#before processing any lines make the field separator a semi-colon
BEGIN {
  FS=";"
}

#if the line contains a semi-colon then reorganize it (regex match)
$0 ~ /;/ {
  print $2 $1
}

#if the line does not contain a semi-colon then just print it (not regex match)
$0 !~ /;/ {
  print $0
}


**********EDIT

I noticed my one liner did not account for your strange "begins with weird symbols" request, e.g. "[asdf] The Movie.file" I'll attempt to adapt my commands.

ONE-LINER SOLUTION #2
Code:
ls -1 | sed 's/[._]/ /g; s/^\(\[[^]]*\]\s*\)\(.*\)/\2;\1/; s/^\(The \|A \)\(.*\)/\2;\1/' | sort | awk 'BEGIN{FS=";"} NF == 1 {print $0} NF == 2 {print $2 $1} NF == 3 {print $2 $3 $1}'
Now sed does three replacements on a single line.
Code:
#replace all periods and underscores with spaces
s/[._]/ /g;

#reorganize "[asdf] The Movie.file" to "The Movie.file;[asdf]"
s/^\(\[[^]]*\]\s*\)\(.*\)/\2;\1/;

#reorganize "The Movie.file;[asdf]" to "Movie.file;[asdf];The"
s/^\(The \|A \)\(.*\)/\2;\1/
After doing the sort I changed the awk script a little bit. Now it detects if there's 1, 2, or 3 fields using the number of fields variable (NF).
Code:
#set field separator to semi-colon
BEGIN {
 FS=";"
}

#If number of fields = 1 then just print the whole line
NF == 1 {
  print $0
}

#If number of fields = 2 then just swap them in a manner that renders the original file name.
NF == 2 {
  print $2 $1
}

#If number of fields = 3 then swap them in a manner that renders the original file name.
NF == 3 {
  print $2 $3 $1
}
That *should* do what you want. I made the choice to replace periods and underscores with spaces because accounting for them would have just made the one liner too complicated (as if it wasn't complicated enough!)

Last edited by sag47; 07-09-2013 at 03:15 AM.
 
1 members found this post helpful.
Old 07-09-2013, 02:59 AM   #3
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
There's a way to do this purely in Bash and Bash alone as you require, but let's have a concept first. If you want, you could base it to create scripts for Awk or other languages like Ruby instead. You could expect other commands or newer version of known commands to solve this, but it might not be available always.

One way to do it is to map the strings to an associative array where keys are already trimmed with common words like "A" and "The", and punctuation marks like *, +, #, etc.

From there you could sort those key strings either by another indexed array or just sort them with the sort command through regeneration by echo.

Once those keys are sorted you can then base from those to reprint the keys in a sorted form.

An example of it would be like this:
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 4 ]] || {
    echo "This script requires Bash version 4.0 or newer."
    exit 1
}

shopt -s extglob

declare -A MAP

K=0

# Map contents

while read -r TITLE; do
    KEY=${TITLE##+([[:cntrl:][:punct:][:blank:]])}
    KEY=${KEY#@(The|A)*([[:blank:]])}
    [[ -z $KEY ]] && KEY=$TITLE
    MAP[$KEY]=$TITLE
    MAP_KEYS[K++]=$KEY
done

# Sort the keys and print sorted list

while read -r KEY; do
    echo "${MAP[$KEY]}"
done < <(IFS=$'\n'; echo "${MAP_KEYS[*]}" | sort)
And you can run that with
Code:
bash script.sh < input_list.txt > output_list.txt
Finally we could use array_sort to sort MAP_KEYS without using an external sorter command.

Last edited by konsolebox; 07-09-2013 at 03:09 AM.
 
1 members found this post helpful.
Old 07-09-2013, 11:22 AM   #4
soupmagnet
LQ Newbie
 
Registered: Sep 2012
Posts: 27

Original Poster
Rep: Reputation: Disabled
Thank you both for these responses.

@konsolebox, I hadn't really thought of using associative arrays in this situation but it makes a lot more sense than what I had planned.
 
Old 07-09-2013, 12:10 PM   #5
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
You can also just associate via a delimiter:
Code:
#!/usr/bin/env bash

#only read the data once, from standard input
lines=$(cat)

#remove "the", "a", and "an"
fixed=$(echo "$lines" | sed -r 's/( |^)([Tt]he|[Aa](n|)) /\1/g')

#find the order that the lines need to be put in
order=$(echo "$fixed" | grep -n . | sort -t: -k2,2 | grep -n . | sort -g -t: -k2,2 | cut -d: -f 1)

#put the lines in order
paste -d: <(echo "$order") <(echo "$lines") | sort -g -t: -k1,1 | cut -d: -f2
You can do this without bash if you use temp files instead of <(...). If you have ":" in the titles, you'd have to translate the first ":" to something else after each grep -n ..

Kevin Barry

Last edited by ta0kira; 07-09-2013 at 12:33 PM.
 
1 members found this post helpful.
Old 07-09-2013, 03:24 PM   #6
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,689

Rep: Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987
An awk solution as konsolebox had suggested:
Code:
ls /path/to/files/ | awk '{m[gensub(/^(A|The)\./,"","1")]=$0}END{asorti(m,a);for(i=1;i <= length(m);i++)print m[a[i]]}'
Obviously alter the gensub to account for anything else to be removed from the front.
 
1 members found this post helpful.
Old 07-09-2013, 08:17 PM   #7
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
Just some corrections:
Code:
    KEY=${KEY#@(The|A)*([[:blank:]])}
Should be
Code:
    KEY=${KEY##@(The|A)+([[:blank:]])}
We also must ignore empty lines (and probably blank lines too). And we should also be careful not to override current values when keys become similar so:
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 4 ]] || {
    echo "This script requires Bash version 4.0 or newer."
    exit 1
}

shopt -s extglob

declare -A MAP

K=0

# Map contents

while read -r TITLE; do
    [[ $TITLE != *([[:blank:]]) ]] || continue
    KEY=${TITLE##+([[:cntrl:][:punct:][:blank:]])}
    KEY=${KEY##@(The|A)+([[:blank:]])}
    [[ -z $KEY ]] && KEY=$TITLE
    KEY_ORIG=$KEY I=0
    until [[ -z ${MAP[$KEY]} ]]; do
        KEY=${KEY_ORIG}$(( I++ ))
    done
    MAP[$KEY]=$TITLE
    MAP_KEYS[K++]=$KEY
done

# Sort the keys and print sorted list

while read -r KEY; do
    echo "${MAP[$KEY]}"
done < <(IFS=$'\n'; echo "${MAP_KEYS[*]}" | sort)
 
1 members found this post helpful.
Old 07-09-2013, 09:42 PM   #8
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
You also need $BASH_VERSINFO instead of BASH_VERSINFO.

Kevin Barry
 
Old 07-09-2013, 10:14 PM   #9
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by ta0kira View Post
You also need $BASH_VERSINFO instead of BASH_VERSINFO.
No that's no longer needed. In [[ BASH_VERSINFO -ge 4 ]], BASH_VERSINFO is equivalent to ${BASH_VERSINFO[0]}. Not only is it a simple form but also helps produce the error message on simpler shells which would see the latter format as a syntax error.
 
Old 07-10-2013, 05:57 AM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,689

Rep: Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987
hmmm ... that doesn't seem to work for me, I get a nasty set of error messages:
Code:
$ [[ BASH_VERSINFO >= 4 ]] && echo yes
bash: syntax error in conditional expression
bash: syntax error near `4'
However, an easy solution is to use (()) which are meant for arithmetic expressions:
Code:
$ (( BASH_VERSINFO >= 4 )) && echo yes
yes
 
1 members found this post helpful.
Old 07-10-2013, 06:51 PM   #11
sag47
Senior Member
 
Registered: Sep 2009
Location: Philly, PA
Distribution: Kubuntu x64, RHEL, Fedora Core, FreeBSD, Windows x64
Posts: 1,505
Blog Entries: 35

Rep: Reputation: 383Reputation: 383Reputation: 383Reputation: 383
I guess a good question to ask is are the file names consistent? Do they always have "[stuff] name.file" or are they sometimes just "name.file" with no stuff?
 
Old 07-10-2013, 06:56 PM   #12
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
hmmm ... that doesn't seem to work for me, I get a nasty set of error messages:
Code:
$ [[ BASH_VERSINFO >= 4 ]] && echo yes
bash: syntax error in conditional expression
bash: syntax error near `4'
However, an easy solution is to use (()) which are meant for arithmetic expressions:
Code:
$ (( BASH_VERSINFO >= 4 )) && echo yes
yes
I believe that's related to >= and not BASH_VERSINFO.

Kevin Barry

Last edited by ta0kira; 07-10-2013 at 06:58 PM.
 
1 members found this post helpful.
Old 07-10-2013, 08:50 PM   #13
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
Quote:
Originally Posted by grail View Post
However, an easy solution is to use (()) which are meant for arithmetic expressions:
Well arithmetic expression is different from conditional expression despite that (( )) is meant for numbers, and [[ ]] is meant for conditional expressions. Also, unfortunately in earlier versions of (( )) I had encountered a problem in which (( )) didn't work the way it was expected. I thought it would have been a convenience but it wasn't.
 
Old 07-11-2013, 03:15 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,689

Rep: Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987Reputation: 1987
My bad, thanks ta0kira for pointing out the oversight. Strangely enough it does seem to work without the $ at the front when using -ge for the test.
 
Old 07-11-2013, 09:27 PM   #15
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 16

Rep: Reputation: 233Reputation: 233Reputation: 233
Much as how (( )) interprets arithmetic expressions without the need of $, so does [[ ]] with arithmetic comparisons.

Memory tells me that around 2006 or 2008 when I had attempted to convert my [[ A -xx -B ]] expressions to (( )), (( )) just returned the same exit code no matter what the expression was. I hope it was actually just a mistake on my part since even now I can't reproduce the same error. Yet I can't help being careful and have doubts with it already.

I actually considered using (( )) for a while already for some arithmetic comparisons where some other expressions can't be handled by [[ ]]. That said I still respect [[ ]] as the main tool for conditional expressions, but for more complex comparisons where we had to enclose expressions in () like (( (A + 4) % 5 < B )) which would be a convenience than having to use a slower re-evaluating sub-expression like [[ $(( (A + 4) % 5 )) -lt B ]]. Some simpler expressions like changing the variable's value on the value could be done in [[ ]] like [[ A+=2 -lt X ]] but I dislike the style inconsistency for not being able to add spaces.

As for (( BASH_VERSINFO >= 4 )), we can't have that as an alternative for [[ BASH_VERSINFO -ge 4 ]] since some shells could see (( )) as a syntax error for ( ). It's not only about the message being shown but also about how we could prevent commands after it to be executed or misinterpreted in other shells, especially those that cause irrevocable changes.

Last edited by konsolebox; 07-11-2013 at 09:29 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Bash "sort" command problem rt1870 Linux - Newbie 5 04-17-2013 08:12 AM
bash script: using "select" to show multi-word options? (like "option 1"/"o zidane_tribal Programming 6 03-21-2013 11:35 AM
history |tr '\011' ' ' |tr -s " "| cut -d' ' -f3 |sort |uniq -c |sort -nbr |head -n10 alan_ri General 12 12-04-2010 10:01 PM
LFS6.3 - Ch5.4.1 "/bin/sh sort not found" error at "make bootstrap" ubyt3m3 Linux From Scratch 2 06-23-2008 01:09 AM
Standard commands give "-bash: open: command not found" even in "su -" and "su root" mibo12 Linux - General 4 11-11-2007 11:18 PM


All times are GMT -5. The time now is 01:46 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration