Multiple column sort
Hi,
I have some data in the following format: Code:
1298501934.311 42.048 Code:
1298501997.943 0.256 Code:
1298501997.943 5.952 Similarly, for the following lines: Code:
1298501997.946 0.448 Code:
1298501997.946 5.44 Code:
1298501990.199 8.064 Code:
1298501990.199 8.064 I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query. I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines). PS: Is there any Python module for text processing scenarios like that? Thank you |
Hi,
Is this what you are looking for: Code:
sort -k1n -k2nr infile | awk 'BEGIN { seen = "" } { if ( $1 != seen ) { print ; seen = $1 } next }' |
Quote:
|
You're welcome :)
|
For the exercise, here is a Ruby solution:
Code:
ruby -ane 'BEGIN{b = {}}; b.merge!({ $F[0] => $F[1] }) if b[$F[0]].nil? || b[$F[0]] < $F[1];END{b.each do |k, v| puts "#{k} #{v}" end}' file |
Druuna's answer is excellent. It is the portable, simple solution.
Since you expressed an interest in awk programming, here is an alternate solution, with explanations: Code:
awk '(NF>1) { if ($1 in data) { This condition is needed because awk considers nonexistent equal to zero. Without the condition, it would ignore negative numbers! This way, it will accept anything as the first value (second field), even a string, and update it if it sees anything it considers "larger". This works for numbers as well as strings. While this streams the data (in linear time with respect to input size), it does keep the data in memory until the end. For each key (first field), only the thus far seen largest value is kept, so typically it should use less memory than presorting it with e.g. sort command. For normal data sets this does not matter, but if you have a humongous data set, with a lot of duplicate keys, this method may be faster. The END rule is only run once after everything else. It will loop over all keys in the data array (first fields from the input), and output the value for each one. Note that I use %s for strings; the fields are only interpreted when comparing, never converted from/to numbers, so you are assured to get the original values. (Note that Druuna's solution does the same, it also always yields the original data, without conversions.) The output order is random by default. If you use mawk, you can add PROCINFO["sorted_in"]="@ind_num_asc" at the beginning of the END rule to get the output sorted. If you use gawk, you can modify the END rule, to Code:
awk '(NF>1) { if ($1 in data) { Unfortunately POSIX does not specify any portable sort functions. While you can obviously code one yourself, it would be counter-productive; the sort command does it much more efficiently. If you have few duplications, I'd recommend Druuna's solution. If you have lots and lots of duplicate keys, using this version and sorting the results may be a bit faster or use less RAM. |
And a slightly condensed version of druuna's:
Code:
sort -k1n -k2nr file | awk '(old != $1)?old=$1:0' Code:
ruby -ne 'b = $<.readlines.sort.each { |e| puts b if b && b.split[0] != e.split[0]; b = e };puts b[-1]' file |
Cool,good solution!
|
All times are GMT -5. The time now is 10:03 PM. |