LinuxQuestions.org - Multiple column sort

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Multiple column sort (https://www.linuxquestions.org/questions/programming-9/multiple-column-sort-902979/)

Multiple column sort

Hi,

I have some data in the following format:

Code:

1298501934.311 42.048

1298501934.311 60.096

1298501934.311 64.128

1298501934.311 64.839

1298501944.203 28.352

1298501966.283 6.144

1298501972.900 0

1298501972.939 0

1298501972.943 0

1298501972.960 0

1298501972.961 0

1298501972.964 0

1298501973.964 28.636

1298501974.215 27.52

1298501974.407 25.984

1298501974.527 27.072

1298501974.527 31.168

1298501974.591 30.144

1298501974.591 31.296

1298501974.83 27.605

1298501975.804 28.096

1298501976.271 23.879

1298501978.488 25.472

1298501978.744 25.088

1298501978.808 25.088

1298501978.936 26.24

1298501979.123 26.048

1298501980.470 23.75

1298501980.86 17.53

1298501982.392 22.336

1298501990.199 8.064

1298501997.943 0.256

1298501997.943 0.448

1298501997.943 0.512

1298501997.943 5.952

1298501997.946 0.448

1298501997.946 0.576

1298501997.946 5.44

My goal is to get the maximum value from the right column for each unique value in the left column. For instance, after processing the following 4 lines:

Code:

1298501997.943 0.256

1298501997.943 0.448

1298501997.943 0.512

1298501997.943 5.952

I would like to get just the last line,

Code:

1298501997.943 5.952

since "5.952" is the largest value for "1298501997.943".

Similarly, for the following lines:

Code:

1298501997.946 0.448

1298501997.946 0.576

1298501997.946 5.44

I would like to get:

Code:

1298501997.946 5.44

And for:

Code:

1298501990.199 8.064

simply:

Code:

1298501990.199 8.064

and so on...

I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query.
I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines).

PS: Is there any Python module for text processing scenarios like that?

Thank you

Hi,

Is this what you are looking for:

Code:

sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }'

Hope this helps.

Quote:

Originally Posted by druuna (Post 4471455)

Hi,

Is this what you are looking for:

Code:

sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }'

Hope this helps.

Yes, exactly what I was looking for! Thank you very much!

For the exercise, here is a Ruby solution:

Code:

ruby -ane 'BEGIN{b = {}}; b.merge!({ $F[0] => $F[1] }) if b[$F[0]].nil? || b[$F[0]] < $F[1];END{b.each do |k, v| puts "#{k} #{v}" end}' file

Druuna's answer is excellent. It is the portable, simple solution.

Since you expressed an interest in awk programming, here is an alternate solution, with explanations:

Code:

awk '(NF>1) { if ($1 in data) {

                  if ($2 > data[$1]) data[$1] = $2

              } else

                  data[$1] = $2

            }

        END { for (i in data)

                  printf("%s %s\n", i, data[i])

            }' infile

The (NF>1) rule means that the following action is only done for records (lines) that contain more than one field (word). If the first field is a key in the associative array data and the second field is larger than the existing value, the value is updated. If the first field is not a key, then the value is added to the array.

This condition is needed because awk considers nonexistent equal to zero. Without the condition, it would ignore negative numbers! This way, it will accept anything as the first value (second field), even a string, and update it if it sees anything it considers "larger". This works for numbers as well as strings.

While this streams the data (in linear time with respect to input size), it does keep the data in memory until the end. For each key (first field), only the thus far seen largest value is kept, so typically it should use less memory than presorting it with e.g. sort command. For normal data sets this does not matter, but if you have a humongous data set, with a lot of duplicate keys, this method may be faster.

The END rule is only run once after everything else. It will loop over all keys in the data array (first fields from the input), and output the value for each one. Note that I use %s for strings; the fields are only interpreted when comparing, never converted from/to numbers, so you are assured to get the original values. (Note that Druuna's solution does the same, it also always yields the original data, without conversions.)

The output order is random by default. If you use mawk, you can add PROCINFO["sorted_in"]="@ind_num_asc" at the beginning of the END rule to get the output sorted. If you use gawk, you can modify the END rule, to

Code:

awk '(NF>1) { if ($1 in data) {

                  if ($2 > data[$1]) data[$1] = $2

              } else

                  data[$1] = $2

            }

        END { n = asorti(data, key)

              for (i = 1; i <= n; i++)

                  printf("%s %s\n", key[i], data[key[i]])

            }' infile

where the END rule first sort the keys (first fields) into array key, then output the values (second fields) in that order.

Unfortunately POSIX does not specify any portable sort functions. While you can obviously code one yourself, it would be counter-productive; the sort command does it much more efficiently. If you have few duplications, I'd recommend Druuna's solution. If you have lots and lots of duplicate keys, using this version and sorting the results may be a bit faster or use less RAM.

And a slightly condensed version of druuna's:

Code:

sort -k1n -k2nr file | awk '(old != $1)?old=$1:0'

And an alternate ruby:

Code:

ruby -ne 'b = $<.readlines.sort.each { |e| puts b if b && b.split[0] != e.split[0]; b = e };puts b[-1]' file