LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-14-2011, 08:21 AM   #1
brownflamigo1
Member
 
Registered: Jun 2007
Distribution: Ubuntu
Posts: 90

Rep: Reputation: 15
Multiple column sort


Hi,

I have some data in the following format:
Code:
1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
My goal is to get the maximum value from the right column for each unique value in the left column. For instance, after processing the following 4 lines:
Code:
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
I would like to get just the last line,
Code:
1298501997.943 5.952
since "5.952" is the largest value for "1298501997.943".

Similarly, for the following lines:
Code:
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
I would like to get:
Code:
1298501997.946 5.44
And for:
Code:
1298501990.199 8.064
simply:
Code:
1298501990.199 8.064
and so on...

I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query.
I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines).

PS: Is there any Python module for text processing scenarios like that?

Thank you

Last edited by brownflamigo1; 09-14-2011 at 08:31 AM.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 09-14-2011, 08:44 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
Hi,

Is this what you are looking for:
Code:
sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }'
Hope this helps.
 
2 members found this post helpful.
Old 09-14-2011, 08:47 AM   #3
brownflamigo1
Member
 
Registered: Jun 2007
Distribution: Ubuntu
Posts: 90

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by druuna View Post
Hi,

Is this what you are looking for:
Code:
sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }'
Hope this helps.
Yes, exactly what I was looking for! Thank you very much!
 
Old 09-14-2011, 08:54 AM   #4
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
You're welcome
 
Old 09-14-2011, 11:39 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008

Rep: Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193
For the exercise, here is a Ruby solution:
Code:
ruby -ane 'BEGIN{b = {}}; b.merge!({ $F[0] => $F[1] }) if b[$F[0]].nil? || b[$F[0]] < $F[1];END{b.each do |k, v| puts "#{k} #{v}" end}' file
 
Old 09-14-2011, 11:43 AM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Druuna's answer is excellent. It is the portable, simple solution.

Since you expressed an interest in awk programming, here is an alternate solution, with explanations:
Code:
awk '(NF>1) { if ($1 in data) {
                  if ($2 > data[$1]) data[$1] = $2
              } else
                  data[$1] = $2
            }
        END { for (i in data)
                  printf("%s %s\n", i, data[i])
            }' infile
The (NF>1) rule means that the following action is only done for records (lines) that contain more than one field (word). If the first field is a key in the associative array data and the second field is larger than the existing value, the value is updated. If the first field is not a key, then the value is added to the array.

This condition is needed because awk considers nonexistent equal to zero. Without the condition, it would ignore negative numbers! This way, it will accept anything as the first value (second field), even a string, and update it if it sees anything it considers "larger". This works for numbers as well as strings.

While this streams the data (in linear time with respect to input size), it does keep the data in memory until the end. For each key (first field), only the thus far seen largest value is kept, so typically it should use less memory than presorting it with e.g. sort command. For normal data sets this does not matter, but if you have a humongous data set, with a lot of duplicate keys, this method may be faster.

The END rule is only run once after everything else. It will loop over all keys in the data array (first fields from the input), and output the value for each one. Note that I use %s for strings; the fields are only interpreted when comparing, never converted from/to numbers, so you are assured to get the original values. (Note that Druuna's solution does the same, it also always yields the original data, without conversions.)

The output order is random by default. If you use mawk, you can add PROCINFO["sorted_in"]="@ind_num_asc" at the beginning of the END rule to get the output sorted. If you use gawk, you can modify the END rule, to
Code:
awk '(NF>1) { if ($1 in data) {
                  if ($2 > data[$1]) data[$1] = $2
              } else
                  data[$1] = $2
            }
        END { n = asorti(data, key)
              for (i = 1; i <= n; i++)
                  printf("%s %s\n", key[i], data[key[i]])
            }' infile
where the END rule first sort the keys (first fields) into array key, then output the values (second fields) in that order.

Unfortunately POSIX does not specify any portable sort functions. While you can obviously code one yourself, it would be counter-productive; the sort command does it much more efficiently. If you have few duplications, I'd recommend Druuna's solution. If you have lots and lots of duplicate keys, using this version and sorting the results may be a bit faster or use less RAM.
 
Old 09-14-2011, 12:36 PM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008

Rep: Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193
And a slightly condensed version of druuna's:
Code:
sort -k1n -k2nr file | awk '(old != $1)?old=$1:0'
And an alternate ruby:
Code:
ruby -ne 'b = $<.readlines.sort.each { |e| puts b if b && b.split[0] != e.split[0]; b = e };puts b[-1]' file

Last edited by grail; 09-14-2011 at 12:59 PM.
 
Old 09-15-2011, 09:05 PM   #8
crulat
Member
 
Registered: Sep 2011
Location: BeiJing China
Posts: 34

Rep: Reputation: Disabled
Cool,good solution!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sort multiple columns + replace another column cedance Linux - Newbie 1 03-29-2011 05:59 AM
awk multiple column into single column ilukacevic Programming 49 07-19-2010 07:23 PM
unix sort on column PMP Linux - Newbie 3 08-24-2009 06:42 AM
[SOLVED] Unable to sort by column in KDE 4.2.2 bassmadrigal Slackware 5 04-18-2009 05:22 AM
sort on a single column? baidym Programming 3 01-03-2009 08:46 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration