Pattern count in a file --> scripting

froglinux · 06-10-2006, 11:42 AM

Hi Group,

See the following log file of apache (HTTPD).
I want a script which can calculate the busiest date. This file is long (very long) with the lines below:

60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET"
60.231.97.196 - - [04/Jan/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [05/Feb/2005:11:43:27 +1100] "GET"
60.231.97.195 - - [05/Feb/2005:11:43:27 +1100] "GET"
60.231.97.194 - - [04/Mar/2005:11:43:27 +1100] "GET"
60.231.97.193 - - [04/Mar/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [06/Feb/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [06/Feb/2005:11:43:27 +1100] "GET"
60.231.97.191 - - [06/Mar/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [06/Mar/2005:11:43:27 +1100] "GET"

I can do it with grep command as below, which works perfectly:

# less access.log |grep -c 03/Jan/2005
03

but with this I need to put the pattern manualli i.e. 03/Jan/2005.

Is there a good way to write this script which will automatically give me a busyiest day of the year. the file has more then 4 yrs access record so its only possible with a good script.

Any thought are welcome.

spirit receiver · 06-10-2006, 12:11 PM

Does the following work?

Code:

sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -1

Note that

You said you were looking for the busiest day of the year, but this will of course give one of the busiest days of all four years. So, split the file into separate years using grep.
Instead of picking the last entry using "tail -1", you could look for all days that were as busy as this one.

froglinux · 06-10-2006, 09:23 PM

Hi Spript,

Thanks for replying. See below, it seems nothing happening when I run this script.

mail:~/tt #
mail:~/tt # ls -l
total 11461
drwxr-xr-x 2 root root 200 Jun 11 12:11 .
drwx------ 26 root root 2040 Jun 11 12:10 ..
-rwxr-xr-x 1 root root 5850517 Jun 11 12:11 access.log
mail:~/tt #
mail:~/tt #
mail:~/tt # sed -ne "s|^.* - - \[$[0-9a-zA-Z/]\{11\}$:.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -10
mail:~/tt # sed -ne "s|^.* - - \[$[0-9a-zA-Z/]\{11\}$:.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -1
mail:~/tt #

I think My format of access.log file is different then I posted you. see below

ail:~/tt #
mail:~/tt #
mail:~/tt # head -n 5 access.log
64.68.82.55 - - [18/Apr/2004:22:55:14 +1000] "GET /~jjessiman/lego/4488.html HTTP/1.0" 200 1337
134.115.68.21 - - [18/Apr/2004:22:59:56 +1000] "GET /ASGAP/gif/asgaplo1.gif HTTP/1.0" 200 1465
203.40.195.112 - - [18/Apr/2004:23:01:50 +1000] "GET /ASGAP/jpg/980530s.jpg HTTP/1.1" 200 7639
203.40.195.112 - - [18/Apr/2004:23:01:55 +1000] "GET /ASGAP/jpg/920803s.jpg HTTP/1.1" 200 5578
194.75.245.98 - - [18/Apr/2004:23:03:56 +1000] "GET /ASGAP/gif/exclam.gif HTTP/1.0" 200 1229
mail:~/tt #
mail:~/tt #
mail:~/t

That could be a problem of nothing comming up after running the script.

1) Please based on above logs, pls suggest a new one.

2) also suggest the access.log file hass more then 4 years record, e.g 2001,2002, 2003 2005 etc. will this script match that as well ?

3) What should be change in the script if I want top 10 busy dates?

Thanks in ADvance.

spirit receiver · 06-11-2006, 06:02 AM

The sed search pattern doesn't match the other format. Try the following:

Code:

sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1

The command doesn't care how many years are in access.log, it just takes one of the busiest dates in the entire file.
In view of your third question, remove "| tail -1" from the command and have a look at its output, I guess you'll see how to extract the information you want.

bigearsbilly · 06-13-2006, 05:39 AM

try this:

Code:

#!/usr/bin/perl -w


while (<>) {


    s/:.*//;
    s/.*\[//;
    s|/||g;
    chomp;
    $total{$_} += 1;
}
foreach $date ( sort {$total{$b} <=> $total{$a}} keys(%total) ) {
    print STDOUT "$date had $total{$date} hits\n";
}

froglinux · 06-17-2006, 05:23 AM

Quote:

Originally Posted by spirit receiver

The sed search pattern doesn't match the other format. Try the following:

Code:

sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1

The command doesn't care how many years are in access.log, it just takes one of the busiest dates in the entire file.
In view of your third question, remove "| tail -1" from the command and have a look at its output, I guess you'll see how to extract the information you want.

Hi,

Above works but I am not getting the good results. the problem is with grep the counted date is suppose 20 times and with this script its only showing 18 times.

#less access.log |grep 20/April/2004 <enter>

230

So the actual results are 230 (no of lines for 20 aprils)

When I am doing with that script (mensioned above) see below

sed -ne "s|^.* - - \[$[0-9a-zA-Z/]\{11\}$:.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1

228

The result is 230.

thats the problem the script is not providing an actual acounters.

Where could be the problem ?

Cheers!

Frog

spirit receiver · 06-17-2006, 07:41 AM

The reason is of course that you where only grepping for the date (20/April/2004, I guess it should rather be 20/Apr/2004), whereas the sed command looks for lines containing "- -[date:" and the word "GET". You'll have to say precisely what you want, if you're only interested in lines containing the date, either of the following will probably do, where the second is modelled after bigearsbilly's suggestion:

Code:

sed -ne "s|^.*\([0-9]\{2\}/[a-zA-Z]\{3\}/[0-9]\{4\}\).*$|\1|p"
sed -ne "s|^[^\[]*\[\([^:]*\):.*$|\1|p"

sorin25 · 06-17-2006, 01:55 PM

May be this will help you:

Code:

cat access.log| grep -E -o '[0-9]{2}/[A-Za-z]{3}/2006' | sort | uniq -c | sort -n