LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-10-2006, 11:42 AM   #1
froglinux
LQ Newbie
 
Registered: Jun 2006
Posts: 12

Rep: Reputation: 0
Pattern count in a file --> scripting


Hi Group,

See the following log file of apache (HTTPD).
I want a script which can calculate the busiest date. This file is long (very long) with the lines below:



60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [03/Jan/2005:11:43:27 +1100] "GET"
60.231.97.196 - - [04/Jan/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [05/Feb/2005:11:43:27 +1100] "GET"
60.231.97.195 - - [05/Feb/2005:11:43:27 +1100] "GET"
60.231.97.194 - - [04/Mar/2005:11:43:27 +1100] "GET"
60.231.97.193 - - [04/Mar/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [06/Feb/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [06/Feb/2005:11:43:27 +1100] "GET"
60.231.97.191 - - [06/Mar/2005:11:43:27 +1100] "GET"
60.231.97.192 - - [06/Mar/2005:11:43:27 +1100] "GET"


I can do it with grep command as below, which works perfectly:

# less access.log |grep -c 03/Jan/2005
03

but with this I need to put the pattern manualli i.e. 03/Jan/2005.

Is there a good way to write this script which will automatically give me a busyiest day of the year. the file has more then 4 yrs access record so its only possible with a good script.

Any thought are welcome.
 
Old 06-10-2006, 12:11 PM   #2
spirit receiver
Member
 
Registered: May 2006
Location: Frankfurt, Germany
Distribution: SUSE 10.2
Posts: 424

Rep: Reputation: 33
Does the following work?
Code:
sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -1
Note that
  • You said you were looking for the busiest day of the year, but this will of course give one of the busiest days of all four years. So, split the file into separate years using grep.
  • Instead of picking the last entry using "tail -1", you could look for all days that were as busy as this one.
 
Old 06-10-2006, 09:23 PM   #3
froglinux
LQ Newbie
 
Registered: Jun 2006
Posts: 12

Original Poster
Rep: Reputation: 0
Hi Spript,

Thanks for replying. See below, it seems nothing happening when I run this script.

mail:~/tt #
mail:~/tt # ls -l
total 11461
drwxr-xr-x 2 root root 200 Jun 11 12:11 .
drwx------ 26 root root 2040 Jun 11 12:10 ..
-rwxr-xr-x 1 root root 5850517 Jun 11 12:11 access.log
mail:~/tt #
mail:~/tt #
mail:~/tt # sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -10
mail:~/tt # sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET\"$|\1|p" access.log | uniq -c | sort | tail -1
mail:~/tt #

I think My format of access.log file is different then I posted you. see below

ail:~/tt #
mail:~/tt #
mail:~/tt # head -n 5 access.log
64.68.82.55 - - [18/Apr/2004:22:55:14 +1000] "GET /~jjessiman/lego/4488.html HTTP/1.0" 200 1337
134.115.68.21 - - [18/Apr/2004:22:59:56 +1000] "GET /ASGAP/gif/asgaplo1.gif HTTP/1.0" 200 1465
203.40.195.112 - - [18/Apr/2004:23:01:50 +1000] "GET /ASGAP/jpg/980530s.jpg HTTP/1.1" 200 7639
203.40.195.112 - - [18/Apr/2004:23:01:55 +1000] "GET /ASGAP/jpg/920803s.jpg HTTP/1.1" 200 5578
194.75.245.98 - - [18/Apr/2004:23:03:56 +1000] "GET /ASGAP/gif/exclam.gif HTTP/1.0" 200 1229
mail:~/tt #
mail:~/tt #
mail:~/t


That could be a problem of nothing comming up after running the script.

1) Please based on above logs, pls suggest a new one.

2) also suggest the access.log file hass more then 4 years record, e.g 2001,2002, 2003 2005 etc. will this script match that as well ?

3) What should be change in the script if I want top 10 busy dates?



Thanks in ADvance.

Last edited by froglinux; 06-10-2006 at 10:03 PM.
 
Old 06-11-2006, 06:02 AM   #4
spirit receiver
Member
 
Registered: May 2006
Location: Frankfurt, Germany
Distribution: SUSE 10.2
Posts: 424

Rep: Reputation: 33
The sed search pattern doesn't match the other format. Try the following:
Code:
sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1
The command doesn't care how many years are in access.log, it just takes one of the busiest dates in the entire file.
In view of your third question, remove "| tail -1" from the command and have a look at its output, I guess you'll see how to extract the information you want.
 
Old 06-13-2006, 05:39 AM   #5
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
try this:

Code:
#!/usr/bin/perl -w


while (<>) {


    s/:.*//;
    s/.*\[//;
    s|/||g;
    chomp;
    $total{$_} += 1;
}
foreach $date ( sort {$total{$b} <=> $total{$a}} keys(%total) ) {
    print STDOUT "$date had $total{$date} hits\n";
}
 
Old 06-17-2006, 05:23 AM   #6
froglinux
LQ Newbie
 
Registered: Jun 2006
Posts: 12

Original Poster
Rep: Reputation: 0
Exclamation

Quote:
Originally Posted by spirit receiver
The sed search pattern doesn't match the other format. Try the following:
Code:
sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1
The command doesn't care how many years are in access.log, it just takes one of the busiest dates in the entire file.
In view of your third question, remove "| tail -1" from the command and have a look at its output, I guess you'll see how to extract the information you want.
Hi,

Above works but I am not getting the good results. the problem is with grep the counted date is suppose 20 times and with this script its only showing 18 times.

#less access.log |grep 20/April/2004 <enter>

230

So the actual results are 230 (no of lines for 20 aprils)

When I am doing with that script (mensioned above) see below

sed -ne "s|^.* - - \[\([0-9a-zA-Z/]\{11\}\):.*\"GET.*$|\1|p" access.log | uniq -c | sort | tail -1

228

The result is 230.

thats the problem the script is not providing an actual acounters.

Where could be the problem ?

Cheers!

Frog
 
Old 06-17-2006, 07:41 AM   #7
spirit receiver
Member
 
Registered: May 2006
Location: Frankfurt, Germany
Distribution: SUSE 10.2
Posts: 424

Rep: Reputation: 33
The reason is of course that you where only grepping for the date (20/April/2004, I guess it should rather be 20/Apr/2004), whereas the sed command looks for lines containing "- -[date:" and the word "GET". You'll have to say precisely what you want, if you're only interested in lines containing the date, either of the following will probably do, where the second is modelled after bigearsbilly's suggestion:
Code:
sed -ne "s|^.*\([0-9]\{2\}/[a-zA-Z]\{3\}/[0-9]\{4\}\).*$|\1|p"
sed -ne "s|^[^\[]*\[\([^:]*\):.*$|\1|p"
 
Old 06-17-2006, 01:55 PM   #8
sorin25
LQ Newbie
 
Registered: Sep 2005
Location: Romania/Bucharest
Distribution: Ferdora Core 3
Posts: 7

Rep: Reputation: 0
May be this will help you:
Code:
cat access.log| grep -E -o '[0-9]{2}/[A-Za-z]{3}/2006' | sort | uniq -c | sort -n
 
  


Reply

Tags
counting, file, matching, pattern



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
printing pattern match and not whole line that matches pattern Avatar33 Programming 13 05-06-2009 06:17 AM
if test for a file name pattern Melsync Programming 12 12-15-2005 05:42 PM
rename file with a pattern sujith_deva Programming 7 12-16-2004 07:39 PM
how to cp filestructure for specific file pattern mizuki Linux - Newbie 5 05-24-2003 02:58 PM
multiline pattern search in a file mimi Linux - General 1 09-01-2002 12:22 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration