Druuna's answer is excellent. It is the portable, simple solution.
Since you expressed an interest in awk programming, here is an alternate solution, with explanations:
Code:
awk '(NF>1) { if ($1 in data) {
if ($2 > data[$1]) data[$1] = $2
} else
data[$1] = $2
}
END { for (i in data)
printf("%s %s\n", i, data[i])
}' infile
The
(NF>1) rule means that the following action is only done for records (lines) that contain more than one field (word). If the first field is a key in the associative array
data and the second field is larger than the existing value, the value is updated. If the first field is not a key, then the value is added to the array.
This condition is needed because awk considers nonexistent equal to zero. Without the condition, it would ignore negative numbers! This way, it will accept anything as the first value (second field), even a string, and update it if it sees anything it considers "larger". This works for numbers as well as strings.
While this streams the data (in linear time with respect to input size), it does keep the data in memory until the end. For each key (first field), only the thus far seen largest value is kept, so typically it should use less memory than presorting it with e.g. sort command. For normal data sets this does not matter, but if you have a humongous data set, with a lot of duplicate keys, this method may be faster.
The
END rule is only run once after everything else. It will loop over all keys in the
data array (first fields from the input), and output the value for each one. Note that I use
%s for strings; the fields are only interpreted when comparing, never converted from/to numbers, so you are assured to get the original values. (Note that Druuna's solution does the same, it also always yields the original data, without conversions.)
The output order is random by default. If you use mawk, you can add
PROCINFO["sorted_in"]="@ind_num_asc" at the beginning of the
END rule to get the output sorted. If you use gawk, you can modify the END rule, to
Code:
awk '(NF>1) { if ($1 in data) {
if ($2 > data[$1]) data[$1] = $2
} else
data[$1] = $2
}
END { n = asorti(data, key)
for (i = 1; i <= n; i++)
printf("%s %s\n", key[i], data[key[i]])
}' infile
where the END rule first sort the keys (first fields) into array
key, then output the values (second fields) in that order.
Unfortunately POSIX does not specify any portable sort functions. While you can obviously code one yourself, it would be counter-productive; the sort command does it much more efficiently. If you have few duplications, I'd recommend Druuna's solution. If you have lots and lots of duplicate keys, using this version and sorting the results may be a bit faster or use less RAM.