LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   AWK - How to parse a Web log file to count column and the last occurrence of that column (https://www.linuxquestions.org/questions/linux-newbie-8/awk-how-to-parse-a-web-log-file-to-count-column-and-the-last-occurrence-of-that-column-4175608420/)

Alvin88 06-22-2017 06:40 PM

AWK - How to parse a Web log file to count column and the last occurrence of that column
 
I got the file (web_test.log) with some lines, like this one below (of course there is like milion of those lines):

Code:

subdomain.domain.com - - [01/Jun/2017:00:00:06 -0900] "GET /www/var/index.html HTTP/1.0" 200 323985
nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267
subdomain.domain.com - - [01/Jun/2017:00:00:14 -0900] "GET /www/var/cool.gif HTTP/1.0" 200 4230310
nextsubdomain.domain.com - - [01/Jun/2017:00:28:12 -0900] "GET /mystory/past/primaryschool/info.html HTTP/1.0" 200 15121283
diffrentsubdomain.domain.com - - [01/Jun/2017:00:29:20 -0400] "GET /www/var/sound/super.gif HTTP/1.0" 200 72666

and I want to extract data like domain name (first column); how many times the domain (first column) is presented in the file and the last time of access that domain (or IP address), so the result should be like:

Code:

subdomain.domain.com - 2 - 01/Jun/2017:00:00:14
nextsubdomain.domain.com - 2 - 01/Jun/2017:00:39:22
diffrentsubdomain.domain.com - 1 - 01/Jun/2017:00:29:20

I tried to use awk for that, and I wrote something like:

Code:

awk '{print $1 " " $4;}' web_test.log | sed 's/\[//' |  awk '{IP_ADDRESS[$1]++; } END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i]}' OFS=" - "
produces output like:

Code:

diffrentsubdomain.domain.com - 1
nextsubdomain.domain.com - 2
subdomain.domain.com - 2

which gives me half of output, and I also tried something like:

Code:

awk '/'"nextsubdomain.domain.com"'/ { lines[last] = $0;} END { print "Last Occurrence: " lines[last]  }' web_test.log
produces output:

Code:

Last Occurrence: nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267
so I got the second part - but the question is how to combine those two in one line?

Is this possible?

I do hope I did explain this quite clearly...

AwesomeMachine 06-22-2017 06:54 PM

You don't necessarily have to use a one-liner. You can write an awk script and run the script.

syg00 06-22-2017 07:28 PM

First things first - you should never pass from awk to sed (or anything else) and back to awk. It has all the regex and string processing you will ever need. So it's now simple to combine your code. Or modify the FS - it can consist of multiple characters, not just one.

And please use plain [code] tags.

Alvin88 06-23-2017 01:51 AM

Many thanks, I am a little bit clever now, but still do not know how to make it work for me - sorry.

pan64 06-23-2017 02:45 AM

Yes, first you need to combine that awk|sed|awk chain into one single script
for example sed can be replaced by a gsub command (inside awk), or you can use a better delimiter:
Code:

awk 'BEGIN { FS="[][ ]*"} ...'

Alvin88 06-23-2017 03:27 AM

Sorry to say, but I cannot find the solution - can you point to the right track?

I can do something like:
Code:

awk '{print $1,substr($4,2)}' OFS=" - " web_test.log
which produce:
Code:

subdomain.domain.com - 01/Jun/2017:00:00:06
subdomain.domain.com - 01/Jun/2017:00:00:14
nextsubdomain.domain.com - 01/Jun/2017:00:28:12
nextsubdomain.domain.com - 01/Jun/2017:00:39:22
diffrentsubdomain.domain.com - 01/Jun/2017:00:29:20

to remove the whole awk|sed|awk chain (many thanks for that!), but the combining the part one (select the necessary field), and the part two - finding the last occurrence in one line (or in a different way) - I simply do not how to do this...

Yes, I want to learn, but most likely I have to go through the awk and sed thing properly from the beginning, this one seems to me like the running - but I have to learn how to walk properly first.

pan64 06-23-2017 03:50 AM

Code:

awk 'BEGIN { FS="[][ ]*"}
    /'"nextsubdomain.domain.com"'/ { lines[last] = $0;}
    { IP_ADDRESS[$1]++;
      IP_DATE[$1] = $4;
      IP_LAST[$1] = $6 }
    END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i], IP_DATE[i], IP_LAST[i];
          print "Last Occurrence: " lines[last]}'      # or you can use this

you can combine things like this. It is not tested and probably not exactly what you need, but this is a way you can follow. (take it as an example)

syg00 06-23-2017 04:37 AM

I think I presumed you had more (awk) knowledge than you do - not yout fault, we all have to learn. Commands can be combined as pan64 shows; separated by a semi-colon. The {} brackets are to enclose a (related) block of commands.
The awk doco is very good, but is a reference, not a teaching book. Doing is the best teacher I find.

Turbocapitalist 06-23-2017 05:08 AM

Though not a direct answer to your question at the moment, you could make future parsing easy by adjusting your log format. I have sometimes done that in the past. On Apache2, there's no reason you cannot set the CustomLog directive to use a better format, say using tabs and an ISO-8601 date format:

Code:

LogFormat "%h\t%l\t%u\t%{%Y-%m-%d %H:%M:%S}t\t%r\t%>s\t%b" custom
CustomLog "logs/access_log" custom

The same principle applies to nginx and lighttpd, though the details are different.

Alvin88 06-23-2017 05:53 AM

1 Attachment(s)
Many thanks to all you guys for you help and input on this question.

Seems to me like I have an answer - please have a look below:

Test data (as cat web_test.log):
Code:

subdomain.domain.com - - [01/Jun/2017:00:00:06 -0900] "GET /www/var/index.html HTTP/1.0" 200 323985
subdomain.domain.com - - [01/Jun/2017:00:00:14 -0900] "GET /www/var/cool.gif HTTP/1.0" 200 4230310
nextsubdomain.domain.com - - [01/Jun/2017:00:28:12 -0900] "GET /mystory/past/primaryschool/info.html HTTP/1.0" 200 15121283
nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267
diffrentsubdomain.domain.com - - [01/Jun/2017:00:29:20 -0400] "GET /www/var/sound/super.gif HTTP/1.0" 200 72666

Solution command:
Code:

awk '{ lines[last] = $0;} { IP_ADDRESS[$1]++; IP_LAST[$1] = substr($4,2) } END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i], IP_LAST[i]};' OFS=" - " web_test.log | sort -n -k3 | column -t | tail -29
Solution - output:
Code:

diffrentsubdomain.domain.com  -  1  -  01/Jun/2017:00:29:20
nextsubdomain.domain.com      -  2  -  01/Jun/2017:00:39:22
subdomain.domain.com          -  2  -  01/Jun/2017:00:00:14

Picture - screen shot - attached. Hope it helps a bit.


One more time - many thanks to all of you for your input, help, suggestions and point me to the right track.

Greatly appreciated.

pan64 06-23-2017 05:59 AM

that is great. But it looks like { lines[last] = $0;} is not in use, you can remove that.


All times are GMT -5. The time now is 09:29 AM.