Multiple column sort

brownflamigo1 · 09-14-2011, 08:21 AM

Hi,

I have some data in the following format:

Code:

1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44

My goal is to get the maximum value from the right column for each unique value in the left column. For instance, after processing the following 4 lines:

Code:

1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952

I would like to get just the last line,

Code:

1298501997.943 5.952

since "5.952" is the largest value for "1298501997.943".

Similarly, for the following lines:

Code:

1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44

I would like to get:

Code:

1298501997.946 5.44

And for:

Code:

1298501990.199 8.064

simply:

Code:

1298501990.199 8.064

and so on...

I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query.
I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines).

PS: Is there any Python module for text processing scenarios like that?

Thank you

druuna · 09-14-2011, 08:44 AM

Hi,

Is this what you are looking for:

Code:

sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }'

Hope this helps.

brownflamigo1 · 09-14-2011, 08:47 AM

Quote:

Originally Posted by druuna

Hi,

Is this what you are looking for:

Code:

sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }'

Hope this helps.

Yes, exactly what I was looking for! Thank you very much!

druuna · 09-14-2011, 08:54 AM

You're welcome

grail · 09-14-2011, 11:39 AM

For the exercise, here is a Ruby solution:

Code:

ruby -ane 'BEGIN{b = {}}; b.merge!({ $F[0] => $F[1] }) if b[$F[0]].nil? || b[$F[0]] < $F[1];END{b.each do |k, v| puts "#{k} #{v}" end}' file

Nominal Animal · 09-14-2011, 11:43 AM

Druuna's answer is excellent. It is the portable, simple solution.

Since you expressed an interest in awk programming, here is an alternate solution, with explanations:

Code:

awk '(NF>1) { if ($1 in data) {
                  if ($2 > data[$1]) data[$1] = $2
              } else
                  data[$1] = $2
            }
        END { for (i in data)
                  printf("%s %s\n", i, data[i])
            }' infile

The (NF>1) rule means that the following action is only done for records (lines) that contain more than one field (word). If the first field is a key in the associative array data and the second field is larger than the existing value, the value is updated. If the first field is not a key, then the value is added to the array.

This condition is needed because awk considers nonexistent equal to zero. Without the condition, it would ignore negative numbers! This way, it will accept anything as the first value (second field), even a string, and update it if it sees anything it considers "larger". This works for numbers as well as strings.

While this streams the data (in linear time with respect to input size), it does keep the data in memory until the end. For each key (first field), only the thus far seen largest value is kept, so typically it should use less memory than presorting it with e.g. sort command. For normal data sets this does not matter, but if you have a humongous data set, with a lot of duplicate keys, this method may be faster.

The END rule is only run once after everything else. It will loop over all keys in the data array (first fields from the input), and output the value for each one. Note that I use %s for strings; the fields are only interpreted when comparing, never converted from/to numbers, so you are assured to get the original values. (Note that Druuna's solution does the same, it also always yields the original data, without conversions.)

The output order is random by default. If you use mawk, you can add PROCINFO["sorted_in"]="@ind_num_asc" at the beginning of the END rule to get the output sorted. If you use gawk, you can modify the END rule, to

Code:

awk '(NF>1) { if ($1 in data) {
                  if ($2 > data[$1]) data[$1] = $2
              } else
                  data[$1] = $2
            }
        END { n = asorti(data, key)
              for (i = 1; i <= n; i++)
                  printf("%s %s\n", key[i], data[key[i]])
            }' infile

where the END rule first sort the keys (first fields) into array key, then output the values (second fields) in that order.

Unfortunately POSIX does not specify any portable sort functions. While you can obviously code one yourself, it would be counter-productive; the sort command does it much more efficiently. If you have few duplications, I'd recommend Druuna's solution. If you have lots and lots of duplicate keys, using this version and sorting the results may be a bit faster or use less RAM.

grail · 09-14-2011, 12:36 PM

And a slightly condensed version of druuna's:

Code:

sort -k1n -k2nr file | awk '(old != $1)?old=$1:0'

And an alternate ruby:

Code:

ruby -ne 'b = $<.readlines.sort.each { |e| puts b if b && b.split[0] != e.split[0]; b = e };puts b[-1]' file

crulat · 09-15-2011, 09:05 PM

Cool,good solution!