Sort File by Field

moo-cow · 06-08-2006, 12:04 PM

I have a file like this:

saegh iubiae iabezu PATTERN cbizge atvet faw
efenmi PATTERN beub htp rubwi riwbr
iauebiubg ubneiu PATTERN aoihgr zvezg
...

I want to sort the lines of the file with the field to the right of PATTERN as the sort key. The correctly sorted example file would look like this:

iauebiubg ubneiu PATTERN aoihgr zvezg
efenmi PATTERN beub htp rubwi riwbr
saegh iubiae iabezu PATTERN cbizge atvet faw

Any idea how to accomplish this?
Thanks!
moo-cow

spirit receiver · 06-08-2006, 01:17 PM

You could use the following to copy the expression following PATTERN to the beginning of each line:

Code:

sed -e "s/\(.*PATTERN \([^ ]\+\).*\)/\2 \1/"

Then sort the result and remove the copied part again using "cut".

MensaWater · 06-08-2006, 01:19 PM

Nice little challenge there.

This works but may not be the most elegant solution:

Code:

for NEXTWORD in `awk -FPATTERN '{print $2}' test |awk '{print $1}' |sort`
do grep " PATTERN $NEXTWORD " filename
done

In the above "filename" would be replaced by whatever your file's name is. "PATTERN" would be whatever your pattern is.

NEXTWORD is an abitrary name for the variable - you can call it BILLYBOB or anything else you prefer.

awk -FPATTERN '{print $2}' says to print anything that occurs after your PATTERN in the file. This of course starts with the next word following PATTERN. (-F tells it to use PATTERN as the delimiter instead of white space).

This is then piped into the next awk which prints only the first word from the previous awk which is the word you were interested in sorting on. (Note this uses white space as the delimiter because as noted above that is the default for awk - if your next word contains any white space you'd have to figure out a different delimiter to use.)

It then sorts the list of next words alphabetically using the sort command.

Finally it greps for any line that contains the next word found by the awk/awk/sort combo that follows directly after your PATTERN (and for good measure puts a space between those and surrounding words so it doesn't accidentally hit on an embedded word).

This will work fine so long as you only have the next word following pattern in your file once. If they appear twice it will still work relative to other next words but the two lines themselves may not be in the order you want.

spirit receiver · 06-08-2006, 01:37 PM

A small remark on jlightner's solution: It seems to me that a NEXTWORD appearing twice will also make each corresponding line appear twice in the result, as the file will be grepped twice for NEXTWORD. You can avoid this by piping the output of "sort" through "uniq".

moo-cow · 06-11-2006, 06:12 PM

Works great, thanks for your help!

MensaWater · 06-12-2006, 09:30 AM

Quote:

Originally Posted by spirit receiver

A small remark on jlightner's solution: It seems to me that a NEXTWORD appearing twice will also make each corresponding line appear twice in the result, as the file will be grepped twice for NEXTWORD. You can avoid this by piping the output of "sort" through "uniq".

It won't appear twice unless it is in the file twice. I think you're confusing this with the standard "ps -ef |grep WORD" solution where you have to remember to grep out the word grep itself. As an FYI I had tested it against his example before posting it.

Restated: my solution can beat up your solution

spirit receiver · 06-12-2006, 09:48 AM

I was talking about the following effect

This is the content of the file to be sorted:

Code:

saegh iubiae iabezu PATTERN cbizge atvet faw
efenmi PATTERN beub htp rubwi riwbr
iauebiubg ubneiu PATTERN aoihgr zvezg
efenmi PATTERN beub faw zvezg

Note that there are two lines with the key "beub". But if your script is applied, it will return four lines with that key:

Code:

iauebiubg ubneiu PATTERN aoihgr zvezg
efenmi PATTERN beub htp rubwi riwbr
efenmi PATTERN beub faw zvezg
efenmi PATTERN beub htp rubwi riwbr
efenmi PATTERN beub faw zvezg
saegh iubiae iabezu PATTERN cbizge atvet faw

Only if I add "uniq" as stated above, the output will look as follows, which is probably what was intended:

Code:

iauebiubg ubneiu PATTERN aoihgr zvezg
efenmi PATTERN beub htp rubwi riwbr
efenmi PATTERN beub faw zvezg
saegh iubiae iabezu PATTERN cbizge atvet faw

To sum up: I win.

MensaWater · 06-12-2006, 09:57 AM

Quote:

To sum up: I win.

Only in a Judo sort of way - you used my awk against me instead of your sed

Actually you made a good point. I was confused by you saying "NEXTWORD twice" because I was thinking you meant I used the variable twice - you meant the word the variable represented could have appeared twice.

spirit receiver · 06-12-2006, 11:26 AM

Quote:

Originally Posted by jlightner

Only in a Judo sort of way

I even considered using Voodoo at first, so you should be happy with that.