LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Explain the awk syntax (https://www.linuxquestions.org/questions/linux-newbie-8/explain-the-awk-syntax-4175432836/)

shivaa 10-18-2012 02:20 AM

Explain the awk syntax
 
Suppose you're using following awk code to filter unique content (i.e. lines) from a file:
awk '!_[$0]++' <filename.txt>
Could anybody explain this awk code? What does $0 and ++ do here? What does !_ do?

kabamaru 10-18-2012 07:22 AM

"_" is actually an array name, although a little cryptic. It could easily be "myarray" or any other valid name.

Arrays in AWK are associative, i.e. they are consisted of key-value pairs. You access the value of an element through its key. If an array doesn't exist, it is created. If you name a key that doesn't exist it is created. If a key is not associated with a value, that value becomes "" (empty string).

"_[$0]" checks if the key $0 (the entire current line) is associated with a zero (or empty string), or nonzero value (or nonempty string). The first case returns false, while the second returns true. Awk will perform an action (if not specified, the default is "print $0") when the test returns true. The leading "!" in "!_[$0]" negates the behavior; it will perform the action (print $0) only when the test returns false.

The "++" adds 1 to the value associated with the key, AFTER the value has been returned to AWK.

Confusing? To make this more clear, lets say we have a file with the contents below:

Code:

john
mary
paul
mary
john
john
phil
paul

AWK reads the first line ("john"). It creates an array named "_" and a key "john" ($0). Because we don't assign a value to _["john"]:

_["john"] = ""
"" evaluates to false, so the pattern returns false
add 1 to _["john"], so now _["john"] = 1

The first line will be printed because it returned false (remember the "!").

Here's a visualization of AWK parsing each line:

Code:

_["john"] is 0 (or "")                return false                _["john"] = 0 + 1 = 1
_["mary"] is 0 (or "")                return false                _["mary"] = 0 + 1 = 1
_["paul"] is 0 (or "")                return false                _["paul"] = 0 + 1 = 1
_["mary"] is 1                        return true                _["mary"] = 1 + 1 = 2
_["john"] is 1                        return true                _["john"] = 1 + 1 = 2
_["john"] is 2                        return true                _["john"] = 2 + 1 = 3
_["phil"] is 0 (or "")                return false                _["phil"] = 0 + 1 = 1
_["paul"] is 1                        return true                _["phil"] = 1 + 1 = 2

A line returns false only the first time it occurs, so only then it will be printed:
Code:

john
mary
paul
phil


David the H. 10-20-2012 10:45 AM

You can find another explanation for it here, as entry #43:

http://www.catonmat.net/blog/awk-one...ined-part-two/

The whole series is very educational


All times are GMT -5. The time now is 04:12 AM.