LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Sorting on an interior field (https://www.linuxquestions.org/questions/programming-9/sorting-on-an-interior-field-4175434547/)

danielbmartin 10-28-2012 07:52 PM

Sorting on an interior field
 
Have: a file of records having this format...
Code:

Name SexAge~StreetAddress~ZipCode
Example...
Code:

Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500
I want to sort the file on the SexAge field. Name may be any number of blank-delimited words. We can be sure that SexAge is there, it is preceded by a blank, and is followed by a tilde (~). The Sex letter may be M, F, or U. Age may be two or three numerics.

One approach: devise a clever RegEx would replace the blank which precedes SexAge with a tilde.

Please advise. sed or awk solutions are preferred.

pixellany 10-28-2012 09:11 PM

First, because you have spaces within fields, you need field separators which are not spaces. Then, to sort the file, you can use---you guessed it---the sort command!

I created a test file to demonstrate:
Code:

[mherring@herring_desk play]$ more list
smith, fred_M70_address_otherstuff
jones, bill_U78_address_etc
green, sally_F56_address_more_stuff
johnson, betty_F34_address_etc.
[mherring@herring_desk play]$ sort -t _ -k 2 list
johnson, betty_F34_address_etc.
green, sally_F56_address_more_stuff
smith, fred_M70_address_otherstuff
jones, bill_U78_address_etc

sed or awk are not really for sorting, but sed will easily help you get proper field separators. e.g., to stick with the tilde (~), you just need to add one in front of the gender/age field.
Example:
Code:

echo "Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500" | sed 's/ \([MFU][0-9]*~\)/~\1/'
Bush, George Herbert Walker~M70~1600 Pennsylvania Ave ~20500
[mherring@herring_desk play]$ echo "Clinton, Hillary F65~address" |sed 's/ \([MFU][0-9]*~\)/~\1/'
Clinton, Hillary~F65~address
[mherring@herring_desk play]$ echo "frankenstein, igor U967~address" |sed 's/ \([MFU][0-9]*~\)/~\1/'
frankenstein, igor~U967~address


danielbmartin 10-28-2012 09:44 PM

[QUOTE=pixellany;4817119]
Code:

sed 's/ \([MFU][0-9]*~\)/~\1/'
Your RegEx is exactly what was needed. Thank you. SOLVED!

Daniel B. Martin

rigor 10-28-2012 10:00 PM

A possible awk solution
 
I put these lines in a file named data.txt:

Code:

Wells, Herbert George M99~1 Main St ~60126
Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500
Greenburg, Anna Olivia F28~15 Baker Court ~90156
Fienstock, Peter Gabriel M1~19643 GreenGables Lane~15108
Zontar, The Thing From Venus U999~99 Strange Way~00000


and this program in a file named sort_ifld.gawk:

Code:

BEGIN {
        FS = "~" ;
      }

{
    field_count = split(  $1 ,  fields ,  /[ ]/  ) ;

    raw_key = fields[ field_count ] ;

    formatted_key = sprintf( "%s%03d" , substr(  raw_key ,  1 ,  1  ) ,  substr(  raw_key ,  2  )  ) ;

    lines[  formatted_key  ]  =  $0 ;
}


END {
        line_count = asorti(  lines ,  sorted_lines  ) ;

        for (  line_num = 1 ;  line_num  <=  line_count ;  line_num++  )
        {
            print lines[  sorted_lines[ line_num ]  ]  ;
        }
    }

and run the program with this command:

Code:

gawk -f  sort_ifld.gawk  <  data.txt
it produces this output:

Code:

Greenburg, Anna Olivia F28~15 Baker Court ~90156
Fienstock, Peter Gabriel M1~19643 GreenGables Lane~15108
Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500
Wells, Herbert George M99~1 Main St ~60126
Zontar, The Thing From Venus U999~99 Strange Way~00000


The problem with just a simple sort, is that the age won't be treated as a number. Instead it will be treated as text.

If you just run sort on this list of numbers:

Code:

1
11
12
19
155

without telling sort to treat them as numbers, then they will be sorted like this:

Code:

1
11
12
155
19

which obviously isn't proper numeric order. But if you run a sort, and also normalize the field size of the numbers so they all have the same number of digits, they can be sorted properly:

Code:

001
011
012
019
155

The awk program I illustrated does that. Although the question was posted in a Non-*NIX Forum the specific form of awk I used, gawk, is also available from Cygwin for use with MS-Windows.

pixellany 10-29-2012 07:08 AM

Quote:

The problem with just a simple sort, is that the age won't be treated as a number. Instead it will be treated as text.
Good point---I did not think of that!!

I wonder why the sort command does not have an option to deal with numbers?

danielbmartin 10-29-2012 09:31 AM

Quote:

Originally Posted by kakaka (Post 4817132)
... with just a simple sort, is that the age won't be treated as a number. Instead it will be treated as text.

The data is abstracted from the voter registration list of my county. (This file is a public record, available for download by anyone.) The lowest age value is 18. There are very few voters older than 100 so three-digit ages aren't much of a problem.

Daniel B. Martin

grail 10-29-2012 12:08 PM

So just for giggles, here is ruby alternative (hopefully with enough comments for others to follow):
Code:

ruby -ne 'BEGIN{h=Hash.new{|k, p| k[p] = []}}; # Set up new hash of arrays
> $_.scan(/(.*)([FMU])(\d{2,3})(.*)/).each{    # break line into parts
> |x| h[x[1]][x[2].to_i] = x };                # assign parts to hash using F,M,U as keys and ages as array indexes
> END{"FMU".each_char{                        # loop over each letter F,M,U in order
> |c| h[c].compact.each{|a| puts a.join}}}    # for each key in hash, compact array (ie remove indices without values) and loop over indices and rejoin items
> ' file

As the ages are converted to numbers the order will be numeric ;)

Of course a more complex solution would be required if we were to assume (correctly based on a voter format) that there are multiple people of the same age and sex.

rigor 10-29-2012 12:47 PM

Quote:

Originally Posted by pixellany (Post 4817344)
I wonder why the sort command does not have an option to deal with numbers?

I didn't mean that the sort command could not be used. The sort command in a "GNU" derived environment has various options to handle numeric sorts. I meant that sorting without using the numeric sorting options didn't seem entirely correct. Also, I felt that since a solution using sed or awk was requested, a solution using sed or awk was most appropriate.

danielbmartin 10-29-2012 02:16 PM

Quote:

Originally Posted by kakaka (Post 4817565)
... I felt that since a solution using sed or awk was requested, a solution using sed or awk was most appropriate.

The problem statement says ...
Quote:

One approach: devise a clever RegEx would replace the blank which precedes SexAge with a tilde.
I knew that sort would do the job if that tilde could be inserted. The sed contributed by pixellany did exactly that. The short-and-sweet solution is ...
Code:

echo; echo "Sort on Sexage"
sed 's/ \([MFU][0-9]*~\)/~\1/' $Work1  \
|sort -t "~" -k 2                      \
|sed 's/~/  /g'                        \
> $Work2

The first sed inserts the tildes.
The sort reorders the file based in the second tilde-delimited field.
The second sed replaces all tildes with pairs of blanks

Thanks to everyone who contributed ideas on this subject.

Daniel B. Martin

David the H. 10-30-2012 12:24 PM

To modify Pixellany's version to sort by the number:

Code:

sort -t _ -k 2.2,2n
This restricts the sorting from the second character of the second field to the end of the second field. If you don't specify the end point of the -k option it will sort the entire line from that point. And the 'n' makes it a numeric sort.

Read sort's info page for full details on how to limit your sorting fields.

danielbmartin 10-30-2012 12:42 PM

[QUOTE=David the H.;4818326]
Code:

sort -t ~ -k 2.2,2n
Tried it; like it! Thank you for this useful refinement.

Daniel B. Martin


All times are GMT -5. The time now is 07:09 PM.