LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-28-2012, 07:52 PM   #1
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Rep: Reputation: 284Reputation: 284Reputation: 284
Sorting on an interior field


Have: a file of records having this format...
Code:
Name SexAge~StreetAddress~ZipCode
Example...
Code:
Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500
I want to sort the file on the SexAge field. Name may be any number of blank-delimited words. We can be sure that SexAge is there, it is preceded by a blank, and is followed by a tilde (~). The Sex letter may be M, F, or U. Age may be two or three numerics.

One approach: devise a clever RegEx would replace the blank which precedes SexAge with a tilde.

Please advise. sed or awk solutions are preferred.
 
Old 10-28-2012, 09:11 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
First, because you have spaces within fields, you need field separators which are not spaces. Then, to sort the file, you can use---you guessed it---the sort command!

I created a test file to demonstrate:
Code:
[mherring@herring_desk play]$ more list
smith, fred_M70_address_otherstuff
jones, bill_U78_address_etc
green, sally_F56_address_more_stuff
johnson, betty_F34_address_etc.
[mherring@herring_desk play]$ sort -t _ -k 2 list
johnson, betty_F34_address_etc.
green, sally_F56_address_more_stuff
smith, fred_M70_address_otherstuff
jones, bill_U78_address_etc
sed or awk are not really for sorting, but sed will easily help you get proper field separators. e.g., to stick with the tilde (~), you just need to add one in front of the gender/age field.
Example:
Code:
echo "Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500" | sed 's/ \([MFU][0-9]*~\)/~\1/'
Bush, George Herbert Walker~M70~1600 Pennsylvania Ave ~20500
[mherring@herring_desk play]$ echo "Clinton, Hillary F65~address" |sed 's/ \([MFU][0-9]*~\)/~\1/'
Clinton, Hillary~F65~address
[mherring@herring_desk play]$ echo "frankenstein, igor U967~address" |sed 's/ \([MFU][0-9]*~\)/~\1/'
frankenstein, igor~U967~address
 
1 members found this post helpful.
Old 10-28-2012, 09:44 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
[QUOTE=pixellany;4817119]
Code:
sed 's/ \([MFU][0-9]*~\)/~\1/'
Your RegEx is exactly what was needed. Thank you. SOLVED!

Daniel B. Martin
 
Old 10-28-2012, 10:00 PM   #4
kakaka
Member
 
Registered: Sep 2003
Posts: 382

Rep: Reputation: 86
A possible awk solution

I put these lines in a file named data.txt:

Code:
Wells, Herbert George M99~1 Main St ~60126
Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500
Greenburg, Anna Olivia F28~15 Baker Court ~90156
Fienstock, Peter Gabriel M1~19643 GreenGables Lane~15108
Zontar, The Thing From Venus U999~99 Strange Way~00000

and this program in a file named sort_ifld.gawk:

Code:
BEGIN {
        FS = "~" ;
      }

{
    field_count = split(  $1 ,  fields ,  /[ ]/  ) ;

    raw_key = fields[ field_count ] ;

    formatted_key = sprintf( "%s%03d" , substr(  raw_key ,  1 ,  1  ) ,  substr(  raw_key ,  2  )  ) ;

    lines[  formatted_key  ]  =  $0 ;
}


END {
        line_count = asorti(  lines ,   sorted_lines  ) ;

        for (  line_num = 1 ;  line_num  <=  line_count ;  line_num++  )
        {
            print lines[  sorted_lines[ line_num ]  ]  ;
        }
    }
and run the program with this command:

Code:
gawk -f  sort_ifld.gawk  <  data.txt
it produces this output:

Code:
Greenburg, Anna Olivia F28~15 Baker Court ~90156
Fienstock, Peter Gabriel M1~19643 GreenGables Lane~15108
Bush, George Herbert Walker M70~1600 Pennsylvania Ave ~20500
Wells, Herbert George M99~1 Main St ~60126
Zontar, The Thing From Venus U999~99 Strange Way~00000

The problem with just a simple sort, is that the age won't be treated as a number. Instead it will be treated as text.

If you just run sort on this list of numbers:

Code:
1
11
12
19
155
without telling sort to treat them as numbers, then they will be sorted like this:

Code:
1
11
12
155
19
which obviously isn't proper numeric order. But if you run a sort, and also normalize the field size of the numbers so they all have the same number of digits, they can be sorted properly:

Code:
001
011
012
019
155
The awk program I illustrated does that. Although the question was posted in a Non-*NIX Forum the specific form of awk I used, gawk, is also available from Cygwin for use with MS-Windows.
 
1 members found this post helpful.
Old 10-29-2012, 07:08 AM   #5
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Quote:
The problem with just a simple sort, is that the age won't be treated as a number. Instead it will be treated as text.
Good point---I did not think of that!!

I wonder why the sort command does not have an option to deal with numbers?
 
Old 10-29-2012, 09:31 AM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by kakaka View Post
... with just a simple sort, is that the age won't be treated as a number. Instead it will be treated as text.
The data is abstracted from the voter registration list of my county. (This file is a public record, available for download by anyone.) The lowest age value is 18. There are very few voters older than 100 so three-digit ages aren't much of a problem.

Daniel B. Martin
 
Old 10-29-2012, 12:08 PM   #7
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,411

Rep: Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874
So just for giggles, here is ruby alternative (hopefully with enough comments for others to follow):
Code:
ruby -ne 'BEGIN{h=Hash.new{|k, p| k[p] = []}}; # Set up new hash of arrays
> $_.scan(/(.*)([FMU])(\d{2,3})(.*)/).each{    # break line into parts
> |x| h[x[1]][x[2].to_i] = x };                # assign parts to hash using F,M,U as keys and ages as array indexes
> END{"FMU".each_char{                         # loop over each letter F,M,U in order
> |c| h[c].compact.each{|a| puts a.join}}}     # for each key in hash, compact array (ie remove indices without values) and loop over indices and rejoin items
> ' file
As the ages are converted to numbers the order will be numeric

Of course a more complex solution would be required if we were to assume (correctly based on a voter format) that there are multiple people of the same age and sex.
 
Old 10-29-2012, 12:47 PM   #8
kakaka
Member
 
Registered: Sep 2003
Posts: 382

Rep: Reputation: 86
Quote:
Originally Posted by pixellany View Post
I wonder why the sort command does not have an option to deal with numbers?
I didn't mean that the sort command could not be used. The sort command in a "GNU" derived environment has various options to handle numeric sorts. I meant that sorting without using the numeric sorting options didn't seem entirely correct. Also, I felt that since a solution using sed or awk was requested, a solution using sed or awk was most appropriate.
 
Old 10-29-2012, 02:16 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
Quote:
Originally Posted by kakaka View Post
... I felt that since a solution using sed or awk was requested, a solution using sed or awk was most appropriate.
The problem statement says ...
Quote:
One approach: devise a clever RegEx would replace the blank which precedes SexAge with a tilde.
I knew that sort would do the job if that tilde could be inserted. The sed contributed by pixellany did exactly that. The short-and-sweet solution is ...
Code:
echo; echo "Sort on Sexage"
sed 's/ \([MFU][0-9]*~\)/~\1/' $Work1  \
|sort -t "~" -k 2                      \
|sed 's/~/  /g'                        \
> $Work2
The first sed inserts the tildes.
The sort reorders the file based in the second tilde-delimited field.
The second sed replaces all tildes with pairs of blanks

Thanks to everyone who contributed ideas on this subject.

Daniel B. Martin

Last edited by danielbmartin; 10-29-2012 at 02:37 PM. Reason: Cosmetic improvement
 
Old 10-30-2012, 12:24 PM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
To modify Pixellany's version to sort by the number:

Code:
sort -t _ -k 2.2,2n
This restricts the sorting from the second character of the second field to the end of the second field. If you don't specify the end point of the -k option it will sort the entire line from that point. And the 'n' makes it a numeric sort.

Read sort's info page for full details on how to limit your sorting fields.
 
1 members found this post helpful.
Old 10-30-2012, 12:42 PM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,057

Original Poster
Rep: Reputation: 284Reputation: 284Reputation: 284
[QUOTE=David the H.;4818326]
Code:
sort -t ~ -k 2.2,2n
Tried it; like it! Thank you for this useful refinement.

Daniel B. Martin
 
  


Reply

Tags
awk, sed, sort


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Adventures in Self-Publishing, Chap. 8: Designing the Interior of your Book LXer Syndicated Linux News 0 01-28-2012 01:30 AM
Interior Design Software For Linux treedstang Linux - Software 1 08-12-2010 06:25 AM
LXer: Sweet Home 3D: simple interior design LXer Syndicated Linux News 0 07-17-2008 06:21 PM
sorting a file using bash on a certain field matt007 Linux - Newbie 1 06-24-2008 12:05 PM
rsh is not support interior command ?? Linuxjia Linux - Newbie 1 03-24-2004 10:07 AM


All times are GMT -5. The time now is 04:20 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration