LinuxQuestions.org - [SOLVED] AWK - How to parse a Web log file to count column and the last occurrence of that column

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - AWK - How to parse a Web log file to count column and the last occurrence of that column (https://www.linuxquestions.org/questions/linux-newbie-8/awk-how-to-parse-a-web-log-file-to-count-column-and-the-last-occurrence-of-that-column-4175608420/)

AWK - How to parse a Web log file to count column and the last occurrence of that column

I got the file (web_test.log) with some lines, like this one below (of course there is like milion of those lines):

Code:

subdomain.domain.com - - [01/Jun/2017:00:00:06 -0900] "GET /www/var/index.html HTTP/1.0" 200 323985

nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267

subdomain.domain.com - - [01/Jun/2017:00:00:14 -0900] "GET /www/var/cool.gif HTTP/1.0" 200 4230310

nextsubdomain.domain.com - - [01/Jun/2017:00:28:12 -0900] "GET /mystory/past/primaryschool/info.html HTTP/1.0" 200 15121283

diffrentsubdomain.domain.com - - [01/Jun/2017:00:29:20 -0400] "GET /www/var/sound/super.gif HTTP/1.0" 200 72666

and I want to extract data like domain name (first column); how many times the domain (first column) is presented in the file and the last time of access that domain (or IP address), so the result should be like:

Code:

subdomain.domain.com - 2 - 01/Jun/2017:00:00:14

nextsubdomain.domain.com - 2 - 01/Jun/2017:00:39:22

diffrentsubdomain.domain.com - 1 - 01/Jun/2017:00:29:20

I tried to use awk for that, and I wrote something like:

Code:

awk '{print $1 " " $4;}' web_test.log | sed 's/\[//' |  awk '{IP_ADDRESS[$1]++; } END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i]}' OFS=" - "

produces output like:

Code:

diffrentsubdomain.domain.com - 1

nextsubdomain.domain.com - 2

subdomain.domain.com - 2

which gives me half of output, and I also tried something like:

Code:

awk '/'"nextsubdomain.domain.com"'/ { lines[last] = $0;} END { print "Last Occurrence: " lines[last] }' web_test.log

produces output:

Code:

Last Occurrence: nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267

so I got the second part - but the question is how to combine those two in one line?

Is this possible?

I do hope I did explain this quite clearly...

You don't necessarily have to use a one-liner. You can write an awk script and run the script.

First things first - you should never pass from awk to sed (or anything else) and back to awk. It has all the regex and string processing you will ever need. So it's now simple to combine your code. Or modify the FS - it can consist of multiple characters, not just one.

And please use plain [code] tags.

Many thanks, I am a little bit clever now, but still do not know how to make it work for me - sorry.

Yes, first you need to combine that awk|sed|awk chain into one single script
for example sed can be replaced by a gsub command (inside awk), or you can use a better delimiter:

Code:

awk 'BEGIN { FS="[][ ]*"} ...'

Sorry to say, but I cannot find the solution - can you point to the right track?

I can do something like:

Code:

awk '{print $1,substr($4,2)}' OFS=" - " web_test.log

which produce:

Code:

subdomain.domain.com - 01/Jun/2017:00:00:06

subdomain.domain.com - 01/Jun/2017:00:00:14

nextsubdomain.domain.com - 01/Jun/2017:00:28:12

nextsubdomain.domain.com - 01/Jun/2017:00:39:22

diffrentsubdomain.domain.com - 01/Jun/2017:00:29:20

to remove the whole awk|sed|awk chain (many thanks for that!), but the combining the part one (select the necessary field), and the part two - finding the last occurrence in one line (or in a different way) - I simply do not how to do this...

Yes, I want to learn, but most likely I have to go through the awk and sed thing properly from the beginning, this one seems to me like the running - but I have to learn how to walk properly first.

Code:

awk 'BEGIN { FS="[][ ]*"}

    /'"nextsubdomain.domain.com"'/ { lines[last] = $0;}

    { IP_ADDRESS[$1]++; 

      IP_DATE[$1] = $4;

      IP_LAST[$1] = $6 }

    END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i], IP_DATE[i], IP_LAST[i]; 

          print "Last Occurrence: " lines[last]}'      # or you can use this

you can combine things like this. It is not tested and probably not exactly what you need, but this is a way you can follow. (take it as an example)

I think I presumed you had more (awk) knowledge than you do - not yout fault, we all have to learn. Commands can be combined as pan64 shows; separated by a semi-colon. The {} brackets are to enclose a (related) block of commands.
The awk doco is very good, but is a reference, not a teaching book. Doing is the best teacher I find.

Though not a direct answer to your question at the moment, you could make future parsing easy by adjusting your log format. I have sometimes done that in the past. On Apache2, there's no reason you cannot set the CustomLog directive to use a better format, say using tabs and an ISO-8601 date format:

Code:

LogFormat "%h\t%l\t%u\t%{%Y-%m-%d %H:%M:%S}t\t%r\t%>s\t%b" custom

CustomLog "logs/access_log" custom

The same principle applies to nginx and lighttpd, though the details are different.

Many thanks to all you guys for you help and input on this question.

Seems to me like I have an answer - please have a look below:

Test data (as cat web_test.log):

Code:

subdomain.domain.com - - [01/Jun/2017:00:00:06 -0900] "GET /www/var/index.html HTTP/1.0" 200 323985

subdomain.domain.com - - [01/Jun/2017:00:00:14 -0900] "GET /www/var/cool.gif HTTP/1.0" 200 4230310

nextsubdomain.domain.com - - [01/Jun/2017:00:28:12 -0900] "GET /mystory/past/primaryschool/info.html HTTP/1.0" 200 15121283

nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267

diffrentsubdomain.domain.com - - [01/Jun/2017:00:29:20 -0400] "GET /www/var/sound/super.gif HTTP/1.0" 200 72666

Solution command:

Code:

awk '{ lines[last] = $0;} { IP_ADDRESS[$1]++; IP_LAST[$1] = substr($4,2) } END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i], IP_LAST[i]};' OFS=" - " web_test.log | sort -n -k3 | column -t | tail -29

Solution - output:

Code:

diffrentsubdomain.domain.com  -  1  -  01/Jun/2017:00:29:20

nextsubdomain.domain.com      -  2  -  01/Jun/2017:00:39:22

subdomain.domain.com          -  2  -  01/Jun/2017:00:00:14

Picture - screen shot - attached. Hope it helps a bit.

One more time - many thanks to all of you for your input, help, suggestions and point me to the right track.

Greatly appreciated.

that is great. But it looks like { lines[last] = $0;} is not in use, you can remove that.