Awk is based on the concept of
record and
fields. By default, each line is a record, and each word on that line is a separate field. You write
actions or
rules, snippets of code that are applied to each record.
There are three types of rules:
- { body }
These rules are run for each record.
- condition { body }
These rules are run for each record, for which condition evaluates to true.
- condition
If condition evaluates to true, then the record is output.
There are special
conditions named
BEGIN and
END . The former is run before the first input record, and the latter after the last input record. The statement
next makes awk skip the rest of the current rule, and skip to the next record, without applying any other rules to the current record. (I don't use
next in this one, though.)
As you can see, awk is quite straightforward. I use
the GNU Awk User's Manual exclusively as my awk reference. Although it does have extensions and quirks other awk variants do not support, the differences are quite well marked. The main advantage of
gawk over other awk variants is that it has
asort() and
asorti()sorting functions, and can use ASCII NUL (\0) as a record or field separator.
To the awk command at hand:
I start with a BEGIN rule.
RS is the regular expression for record separators; I set it to match on any newline convention.
FS is the regular expression for field separators; I set it to match on any linear whitespace. Some awk variants don't like it when the first line is empty, so I start with an empty comment line (
#):
Code:
awk '#
BEGIN {
RS = "(\r\n|\n\r|\r|\n)"
FS = "[\t\v\f ]+"
Next, I explicitly set the
output record separator and field separator too.
Code:
ORS = "\n" # Newline in output
OFS = " " # Field separator in output
}
At this point, we have set the input and output newline conventions and field separators: each input line will be a separate record, and each word a separate field. In output, each newline will end with just LF (Unix/Linux newline convention), and each field will be separated by a space only. (There will be no CR or tabs in the output, no matter what the input.) There are different, more concise ways to achieve the above, but I like this way.
Next, I define the rule to apply to each record. Since this will only work right with records with more than two fields, let's limit to such records:
Note that
NF tells the number of fields in current record.
$0 contains the entire record, and
$1 to
$NF the fields.
I will use associative arrays keyed on the first two fields. To save typing, I save the key in variable
k. Note that in awk, strings are concatenated by just writing them one after another. (Awk does NOT add implicit separators or whitespace in between.)
In awk,
(somekey in somearray) is true if associative array
somearray contains key
somekey. All arrays in awk are associative, and all array keys are strings.
To keep the input records in order, I save each new key
k into array
key, with
keys unique k, starting from one. Also note that in awk, you don't need to initialize any variables, they will default to empty (strings) or zero (numbers).
Code:
if (!(k in key))
key[++keys] = k
To keep a list (actually, a comma-delimited string) of the third fields, I simply append a comma and the third field:
Code:
list[k] = list[k] "," $3
I also keep a count of the number of occurrences of this key:
After all the input records have been processed, there are
keys unique keys in the
list and
count arrays. The former contains the list of third fields as a comma-separated string (with a leading comma), and the latter the number of occurrences of each key. The
key array contains the keys in the order they were first seen, indexed
1..keys.
I could have used just a simple list traversal loop,
for (k in list) , but the
k would be in undefined order then. There are ways you can control the array traversal, but nothing that works in all awk variants. This is why I kept track of the keys separately.
We obviously need to loop over all unique keys, because each unique key will produce one line of output. The
ith key will be
k=key[i] , with
n occurrences of that key:
Code:
END {
for (i = 1; i <= keys; i++) {
k = key[i]
n = count[k]
Since the comma-separated list of third fields for the current key (
list[k]) has an extra leading comma, we need to remove it. I like to be extra careful, and first replace any multiple successive commas with a single comma:
Code:
s = list[k]
gsub(/,,+/, ",", s)
sub(/^,/, "", s)
All that is left is to print the key
k, the list of third fields
s, and the number of occurrences
n:
Code:
printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)
Note that
print k, s, n would produce exact same output. I thought setting the OFS and ORS would have been confusing otherwise, so I wrote the separators explicitly.
That's it. Closing the loop and the END rule completes the script.
Code:
}
}' input-file > output-file
I have the habit of listing one input file, and redirecting the output to a file, but that is just for illustration. You can read either standard input (no file name arguments), or from multiple files (in which case they are processed in the order they are listed in).
If you want to use this as a script, remove everything after the final
}, and write the first line as
and you are done.
Note that according to my tests, the
mawk awk variant is significantly faster than GNU awk (
gawk). If you have it installed, I recommend using
mawk explicitly.
Any questions? Any details you'd like me to clarify?
Hope you find this useful,