LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-22-2017, 07:40 PM   #1
Alvin88
LQ Newbie
 
Registered: Mar 2012
Posts: 18

Rep: Reputation: Disabled
Smile AWK - How to parse a Web log file to count column and the last occurrence of that column


I got the file (web_test.log) with some lines, like this one below (of course there is like milion of those lines):

Code:
subdomain.domain.com - - [01/Jun/2017:00:00:06 -0900] "GET /www/var/index.html HTTP/1.0" 200 323985
nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267
subdomain.domain.com - - [01/Jun/2017:00:00:14 -0900] "GET /www/var/cool.gif HTTP/1.0" 200 4230310
nextsubdomain.domain.com - - [01/Jun/2017:00:28:12 -0900] "GET /mystory/past/primaryschool/info.html HTTP/1.0" 200 15121283
diffrentsubdomain.domain.com - - [01/Jun/2017:00:29:20 -0400] "GET /www/var/sound/super.gif HTTP/1.0" 200 72666
and I want to extract data like domain name (first column); how many times the domain (first column) is presented in the file and the last time of access that domain (or IP address), so the result should be like:

Code:
subdomain.domain.com - 2 - 01/Jun/2017:00:00:14
nextsubdomain.domain.com - 2 - 01/Jun/2017:00:39:22
diffrentsubdomain.domain.com - 1 - 01/Jun/2017:00:29:20
I tried to use awk for that, and I wrote something like:

Code:
awk '{print $1 " " $4;}' web_test.log | sed 's/\[//' |  awk '{IP_ADDRESS[$1]++; } END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i]}' OFS=" - "
produces output like:

Code:
diffrentsubdomain.domain.com - 1
nextsubdomain.domain.com - 2
subdomain.domain.com - 2
which gives me half of output, and I also tried something like:

Code:
awk '/'"nextsubdomain.domain.com"'/ { lines[last] = $0;} END { print "Last Occurrence: " lines[last]  }' web_test.log
produces output:

Code:
Last Occurrence: nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267
so I got the second part - but the question is how to combine those two in one line?

Is this possible?

I do hope I did explain this quite clearly...

Last edited by Alvin88; 06-23-2017 at 02:53 AM. Reason: spelling; tags
 
Old 06-22-2017, 07:54 PM   #2
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524

Rep: Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015
You don't necessarily have to use a one-liner. You can write an awk script and run the script.
 
Old 06-22-2017, 08:28 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,008

Rep: Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099
First things first - you should never pass from awk to sed (or anything else) and back to awk. It has all the regex and string processing you will ever need. So it's now simple to combine your code. Or modify the FS - it can consist of multiple characters, not just one.

And please use plain [code] tags.
 
1 members found this post helpful.
Old 06-23-2017, 02:51 AM   #4
Alvin88
LQ Newbie
 
Registered: Mar 2012
Posts: 18

Original Poster
Rep: Reputation: Disabled
Many thanks, I am a little bit clever now, but still do not know how to make it work for me - sorry.
 
Old 06-23-2017, 03:45 AM   #5
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,068

Rep: Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128
Yes, first you need to combine that awk|sed|awk chain into one single script
for example sed can be replaced by a gsub command (inside awk), or you can use a better delimiter:
Code:
awk 'BEGIN { FS="[][ ]*"} ...'
 
Old 06-23-2017, 04:27 AM   #6
Alvin88
LQ Newbie
 
Registered: Mar 2012
Posts: 18

Original Poster
Rep: Reputation: Disabled
Sorry to say, but I cannot find the solution - can you point to the right track?

I can do something like:
Code:
awk '{print $1,substr($4,2)}' OFS=" - " web_test.log
which produce:
Code:
subdomain.domain.com - 01/Jun/2017:00:00:06
subdomain.domain.com - 01/Jun/2017:00:00:14
nextsubdomain.domain.com - 01/Jun/2017:00:28:12
nextsubdomain.domain.com - 01/Jun/2017:00:39:22
diffrentsubdomain.domain.com - 01/Jun/2017:00:29:20
to remove the whole awk|sed|awk chain (many thanks for that!), but the combining the part one (select the necessary field), and the part two - finding the last occurrence in one line (or in a different way) - I simply do not how to do this...

Yes, I want to learn, but most likely I have to go through the awk and sed thing properly from the beginning, this one seems to me like the running - but I have to learn how to walk properly first.
 
Old 06-23-2017, 04:50 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,068

Rep: Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128
Code:
awk 'BEGIN { FS="[][ ]*"}
     /'"nextsubdomain.domain.com"'/ { lines[last] = $0;}
     { IP_ADDRESS[$1]++; 
       IP_DATE[$1] = $4;
       IP_LAST[$1] = $6 }
     END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i], IP_DATE[i], IP_LAST[i]; 
           print "Last Occurrence: " lines[last]}'      # or you can use this
you can combine things like this. It is not tested and probably not exactly what you need, but this is a way you can follow. (take it as an example)
 
1 members found this post helpful.
Old 06-23-2017, 05:37 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,008

Rep: Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099
I think I presumed you had more (awk) knowledge than you do - not yout fault, we all have to learn. Commands can be combined as pan64 shows; separated by a semi-colon. The {} brackets are to enclose a (related) block of commands.
The awk doco is very good, but is a reference, not a teaching book. Doing is the best teacher I find.
 
Old 06-23-2017, 06:08 AM   #9
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,085
Blog Entries: 3

Rep: Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665Reputation: 3665
Though not a direct answer to your question at the moment, you could make future parsing easy by adjusting your log format. I have sometimes done that in the past. On Apache2, there's no reason you cannot set the CustomLog directive to use a better format, say using tabs and an ISO-8601 date format:

Code:
LogFormat "%h\t%l\t%u\t%{%Y-%m-%d %H:%M:%S}t\t%r\t%>s\t%b" custom
CustomLog "logs/access_log" custom
The same principle applies to nginx and lighttpd, though the details are different.
 
Old 06-23-2017, 06:53 AM   #10
Alvin88
LQ Newbie
 
Registered: Mar 2012
Posts: 18

Original Poster
Rep: Reputation: Disabled
Many thanks to all you guys for you help and input on this question.

Seems to me like I have an answer - please have a look below:

Test data (as cat web_test.log):
Code:
subdomain.domain.com - - [01/Jun/2017:00:00:06 -0900] "GET /www/var/index.html HTTP/1.0" 200 323985
subdomain.domain.com - - [01/Jun/2017:00:00:14 -0900] "GET /www/var/cool.gif HTTP/1.0" 200 4230310
nextsubdomain.domain.com - - [01/Jun/2017:00:28:12 -0900] "GET /mystory/past/primaryschool/info.html HTTP/1.0" 200 15121283
nextsubdomain.domain.com - - [01/Jun/2017:00:39:22 -0900] "GET /mystory/past/primaryschool/images/12312314.gif HTTP/1.0" 200 10211267
diffrentsubdomain.domain.com - - [01/Jun/2017:00:29:20 -0400] "GET /www/var/sound/super.gif HTTP/1.0" 200 72666
Solution command:
Code:
awk '{ lines[last] = $0;} { IP_ADDRESS[$1]++; IP_LAST[$1] = substr($4,2) } END { for (i in IP_ADDRESS) print i,IP_ADDRESS[i], IP_LAST[i]};' OFS=" - " web_test.log | sort -n -k3 | column -t | tail -29
Solution - output:
Code:
diffrentsubdomain.domain.com  -  1  -  01/Jun/2017:00:29:20
nextsubdomain.domain.com      -  2  -  01/Jun/2017:00:39:22
subdomain.domain.com          -  2  -  01/Jun/2017:00:00:14
Picture - screen shot - attached. Hope it helps a bit.


One more time - many thanks to all of you for your input, help, suggestions and point me to the right track.

Greatly appreciated.
Attached Thumbnails
Click image for larger version

Name:	script_web_log_parse.jpg
Views:	46
Size:	84.1 KB
ID:	25303  
 
Old 06-23-2017, 06:59 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,068

Rep: Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128Reputation: 7128
that is great. But it looks like { lines[last] = $0;} is not in use, you can remove that.
 
  


Reply

Tags
apache, awk, awk regex


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] log column in file with awk gav251 Programming 16 08-16-2013 05:05 AM
[SOLVED] AWK fill column from previuos line column akeka Programming 4 01-30-2013 08:16 PM
compare second column of a file then print the first column of it in a ne fil if true java_girl Linux - Newbie 2 03-16-2012 05:50 AM
awk multiple column into single column ilukacevic Programming 49 07-19-2010 08:23 PM
Read text file column by column RVF16 Programming 11 05-31-2009 08:16 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:38 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration