Pattern count in a file --> scripting
Hi Group,
See the following log file of apache (HTTPD). I want a script which can calculate the busiest date. This file is long (very long) with the lines below: 60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET" 60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET" 60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET" 60.231.97.196 - - [04/Jan/2005:11:43:27 +1100] "GET" 60.231.97.192 - - [05/Feb/2005:11:43:27 +1100] "GET" 60.231.97.195 - - [05/Feb/2005:11:43:27 +1100] "GET" 60.231.97.194 - - [04/Mar/2005:11:43:27 +1100] "GET" 60.231.97.193 - - [04/Mar/2005:11:43:27 +1100] "GET" 60.231.97.192 - - [06/Feb/2005:11:43:27 +1100] "GET" 60.231.97.192 - - [06/Feb/2005:11:43:27 +1100] "GET" 60.231.97.191 - - [06/Mar/2005:11:43:27 +1100] "GET" 60.231.97.192 - - [06/Mar/2005:11:43:27 +1100] "GET" I can do it with grep command as below, which works perfectly: # less access.log |grep -c 03/Jan/2005 03 but with this I need to put the pattern manualli i.e. 03/Jan/2005. Is there a good way to write this script which will automatically give me a busyiest day of the year. the file has more then 4 yrs access record so its only possible with a good script. Any thought are welcome. |
Does the following work?
Code:
sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -1
|
Hi Spript,
Thanks for replying. See below, it seems nothing happening when I run this script. mail:~/tt # mail:~/tt # ls -l total 11461 drwxr-xr-x 2 root root 200 Jun 11 12:11 . drwx------ 26 root root 2040 Jun 11 12:10 .. -rwxr-xr-x 1 root root 5850517 Jun 11 12:11 access.log mail:~/tt # mail:~/tt # mail:~/tt # sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -10 mail:~/tt # sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -1 mail:~/tt # I think My format of access.log file is different then I posted you. see below ail:~/tt # mail:~/tt # mail:~/tt # head -n 5 access.log 64.68.82.55 - - [18/Apr/2004:22:55:14 +1000] "GET /~jjessiman/lego/4488.html HTTP/1.0" 200 1337 134.115.68.21 - - [18/Apr/2004:22:59:56 +1000] "GET /ASGAP/gif/asgaplo1.gif HTTP/1.0" 200 1465 203.40.195.112 - - [18/Apr/2004:23:01:50 +1000] "GET /ASGAP/jpg/980530s.jpg HTTP/1.1" 200 7639 203.40.195.112 - - [18/Apr/2004:23:01:55 +1000] "GET /ASGAP/jpg/920803s.jpg HTTP/1.1" 200 5578 194.75.245.98 - - [18/Apr/2004:23:03:56 +1000] "GET /ASGAP/gif/exclam.gif HTTP/1.0" 200 1229 mail:~/tt # mail:~/tt # mail:~/t That could be a problem of nothing comming up after running the script. 1) Please based on above logs, pls suggest a new one. 2) also suggest the access.log file hass more then 4 years record, e.g 2001,2002, 2003 2005 etc. will this script match that as well ? 3) What should be change in the script if I want top 10 busy dates? Thanks in ADvance. |
The sed search pattern doesn't match the other format. Try the following:
Code:
sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1 In view of your third question, remove "| tail -1" from the command and have a look at its output, I guess you'll see how to extract the information you want. |
try this:
Code:
#!/usr/bin/perl -w |
Quote:
Above works but I am not getting the good results. the problem is with grep the counted date is suppose 20 times and with this script its only showing 18 times. #less access.log |grep 20/April/2004 <enter> 230 So the actual results are 230 (no of lines for 20 aprils) When I am doing with that script (mensioned above) see below sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1 228 The result is 230. thats the problem the script is not providing an actual acounters. Where could be the problem ? Cheers! Frog |
The reason is of course that you where only grepping for the date (20/April/2004, I guess it should rather be 20/Apr/2004), whereas the sed command looks for lines containing "- -[date:" and the word "GET". You'll have to say precisely what you want, if you're only interested in lines containing the date, either of the following will probably do, where the second is modelled after bigearsbilly's suggestion:
Code:
sed -ne "s|^.*\([0-9]\{2\}/[a-zA-Z]\{3\}/[0-9]\{4\}\).*$|\1|p" |
May be this will help you:
Code:
cat access.log| grep -E -o '[0-9]{2}/[A-Za-z]{3}/2006' | sort | uniq -c | sort -n |
All times are GMT -5. The time now is 08:22 AM. |