Explain the awk syntax
Suppose you're using following awk code to filter unique content (i.e. lines) from a file:
awk '!_[$0]++' <filename.txt>
Could anybody explain this awk code? What does $0 and ++ do here? What does !_ do?
"_" is actually an array name, although a little cryptic. It could easily be "myarray" or any other valid name.
Arrays in AWK are associative, i.e. they are consisted of key-value pairs. You access the value of an element through its key. If an array doesn't exist, it is created. If you name a key that doesn't exist it is created. If a key is not associated with a value, that value becomes "" (empty string).
"_[$0]" checks if the key $0 (the entire current line) is associated with a zero (or empty string), or nonzero value (or nonempty string). The first case returns false, while the second returns true. Awk will perform an action (if not specified, the default is "print $0") when the test returns true. The leading "!" in "!_[$0]" negates the behavior; it will perform the action (print $0) only when the test returns false.
The "++" adds 1 to the value associated with the key, AFTER the value has been returned to AWK.
Confusing? To make this more clear, lets say we have a file with the contents below:
_["john"] = ""
"" evaluates to false, so the pattern returns false
add 1 to _["john"], so now _["john"] = 1
The first line will be printed because it returned false (remember the "!").
Here's a visualization of AWK parsing each line:
You can find another explanation for it here, as entry #43:
The whole series is very educational
|All times are GMT -5. The time now is 11:49 AM.|