LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-15-2016, 02:01 AM   #1
pix9
Member
 
Registered: Jan 2010
Location: Mumbai, India
Distribution: ArchLinux, Fedora 24, Centos 7.0
Posts: 177

Rep: Reputation: 19
Logic to count per minutes hits on web-server log.


hello Friends,

I am trying to figure out a simple logic to pull out the count of per-minute traffic on website from access logs. I don't want you to write down any code for me any simple math that I can use over here would be great. In meanwhile I will try to scribble some awk for this.

regards
Pix
 
Old 03-15-2016, 05:07 AM   #2
cliffordw
Member
 
Registered: Jan 2012
Location: South Africa
Posts: 509

Rep: Reputation: 203Reputation: 203Reputation: 203
Hi there,

One crude approach would be to simply cut the timestamp out of the log, and use "uniq -c" to count the lines per timestamp. Something like this:

Code:
cut -d'[' -f2 access_log | cut -d':' -f1-3 |uniq -c
     20 15/Mar/2016:07:35
     20 15/Mar/2016:07:36
      2 15/Mar/2016:07:37
      3 15/Mar/2016:08:34
     36 15/Mar/2016:08:47
     36 15/Mar/2016:08:48
      9 15/Mar/2016:09:03
     72 15/Mar/2016:09:08
     36 15/Mar/2016:09:25
      1 15/Mar/2016:10:19
I'm sure you can achieve the same in awk/perl/whatever, and for timestamps in any format necessary.

Hope this helps...
 
Old 03-15-2016, 08:12 PM   #3
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,642
Blog Entries: 4

Rep: Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933
The best approach is to treat this as a statistical project. Therefore, "gather after-the-fact log files, and treat them strictly as a data set.

You will find that existing stats packages ... SASŪ, SPSSŪ, and (open source ...) "R" ... are ready to help you.

The raw data-set provides you with important values, such as "by-the-second timestamps" in various well-known binary or ASCII formats. You will now perform various summary statistics against these data, having grouped them by various mathematical functions applied to them. (It may right-now sound rather strange, to many of you, when I say that SQL's GROUP BY capability is, in fact, a simplification(!) of this idea.)
 
Old 03-16-2016, 01:17 AM   #4
pix9
Member
 
Registered: Jan 2010
Location: Mumbai, India
Distribution: ArchLinux, Fedora 24, Centos 7.0
Posts: 177

Original Poster
Rep: Reputation: 19
Quote:
Originally Posted by cliffordw View Post
Hi there,

One crude approach would be to simply cut the timestamp out of the log, and use "uniq -c" to count the lines per timestamp. Something like this:

Code:
cut -d'[' -f2 access_log | cut -d':' -f1-3 |uniq -c
     20 15/Mar/2016:07:35
     20 15/Mar/2016:07:36
      2 15/Mar/2016:07:37
      3 15/Mar/2016:08:34
     36 15/Mar/2016:08:47
     36 15/Mar/2016:08:48
      9 15/Mar/2016:09:03
     72 15/Mar/2016:09:08
     36 15/Mar/2016:09:25
      1 15/Mar/2016:10:19
I'm sure you can achieve the same in awk/perl/whatever, and for timestamps in any format necessary.

Hope this helps...
Sure this is a simpler approach I will try to wind up my logic around this.
Thank You everyone for the replies.

regards
 
Old 03-16-2016, 02:24 AM   #5
pix9
Member
 
Registered: Jan 2010
Location: Mumbai, India
Distribution: ArchLinux, Fedora 24, Centos 7.0
Posts: 177

Original Poster
Rep: Reputation: 19
Dear cliffordw your suggested thing works well, but IT only gives me count, I would also like separate certain bad requests on servers
for eg
home many requests where HTTP 200 responses and how many were 500 in particular minutes.

Now question here is which approach would be optimal one
Trying to achieve both in single instance ( get result in one complex looking awk query) or
Do same by executing two different jobs( I mean running two different commands and processing same file twice ) on same file.

regards
 
Old 03-16-2016, 02:32 AM   #6
pix9
Member
 
Registered: Jan 2010
Location: Mumbai, India
Distribution: ArchLinux, Fedora 24, Centos 7.0
Posts: 177

Original Poster
Rep: Reputation: 19
Quote:
Originally Posted by sundialsvcs View Post
The best approach is to treat this as a statistical project. Therefore, "gather after-the-fact log files, and treat them strictly as a data set.

You will find that existing stats packages ... SASŪ, SPSSŪ, and (open source ...) "R" ... are ready to help you.

The raw data-set provides you with important values, such as "by-the-second timestamps" in various well-known binary or ASCII formats. You will now perform various summary statistics against these data, having grouped them by various mathematical functions applied to them. (It may right-now sound rather strange, to many of you, when I say that SQL's GROUP BY capability is, in fact, a simplification(!) of this idea.)
Are you suggesting to put things on SQL?
to be honest this is kind of statical project, I am doing it for own analysis.
I will try to find out the readily available packages, I don't know if I really have authority to use these on prod servers on the other hand I have freedom to make use of awk and grep ;-) which is what I ma trying to leverage here.

regards
 
Old 03-16-2016, 09:06 AM   #7
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,642
Blog Entries: 4

Rep: Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933
No, I'm not suggesting the use of SQL, per se, unless it is otherwise indicated. What I was referring to is simply the notion of grouping, and pointing out that SQL provides that capability albeit in a limited form.

Lots of companies wind up abusing using Microsoft Excel and/or sometimes Microsoft Access (SQL ...) for this sort of thing. "R" is a curious, entirely open-source and very powerful "true statistics package."

You're analyzing server log-files after the fact, taking time-stamps and rolling them to form groups. Then, you're interested in statistical measures, particularly mean (average ...) and standard deviation.

Also(!) "google it." A search for apache log file analyzer produces many, many interesting hits.
 
Old 03-18-2016, 03:44 PM   #8
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 703

Rep: Reputation: 111Reputation: 111
An apache log line contains handy information and as such one easy way to work with it is to read the log file with a (perl) script for example, and upon certain values increase counters or make hashes/arrays; however, you need to be sure what format the logs actually provide (and if you change it, you may have to change your script accordingly); perl was my language of choice due to good documentation and very good performance on text data.

Myself I had to deal with something of the same, but the traffic became too high to put it in simple one-liners (and the various kinds of statistics made it hard as well) so I decided to read the log files on the fly (the perl style of 'tail -f') and chunk down each part of the log line and put it in a database. Now I could compute all kinds of statistics with simple SQL queries, and usually I have the statistics I want within a few minutes at most. It's most flexible.

I have simple queries calculating average hits per hour based on the same weekday and hour based on the data of the past X months. The database now contains 700 million rows of log lines, so that should count as quite some data; and runs over a course of a few years... but with it I can get some interesting trending data.
 
Old 03-18-2016, 08:26 PM   #9
aspire1
Member
 
Registered: Dec 2008
Distribution: Ubuntu
Posts: 62

Rep: Reputation: 23
You could also use the pretty common ELK stack rather than rolling your own (elasticsearch/logstash/kibana) and google some tutorials.
 
1 members found this post helpful.
Old 03-22-2016, 02:40 AM   #10
pix9
Member
 
Registered: Jan 2010
Location: Mumbai, India
Distribution: ArchLinux, Fedora 24, Centos 7.0
Posts: 177

Original Poster
Rep: Reputation: 19
Hello everyone, currently I am using following method for doing this. If any better way kindly let me know.

for COUNT in $(number of minutes)
do
ZERODATE=$(date +"%d/%b/%Y:%H:%M" -d "+ ${COUNT} minute")
grep "${ZERODATE}" ${LOGFILE} |tee /tmp/${ZERODATE}.log
$ foo-command /tmp/${ZERODATE}.log
sleep 1
done

regards
 
Old 03-23-2016, 03:32 AM   #11
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 703

Rep: Reputation: 111Reputation: 111
Thumbs up

Quote:
Originally Posted by pix9 View Post
Hello everyone, currently I am using following method for doing this. If any better way kindly let me know.

for COUNT in $(number of minutes)
do
ZERODATE=$(date +"%d/%b/%Y:%H:%M" -d "+ ${COUNT} minute")
grep "${ZERODATE}" ${LOGFILE} |tee /tmp/${ZERODATE}.log
$ foo-command /tmp/${ZERODATE}.log
sleep 1
done

regards
So, for metrics of each day you'd have to iterate 1440 times over the same logfile? That hardly sounds efficient; 60 times of iterations per hour... not that efficient either; I'd rather iterate once over the file and do all computations in that iteration.

Below something along the same lines; more efficient, but not a best practice either:

Code:
while read line
do
  MINUTE="$(echo "${line}" | cut -d ' ' -f 2 | cut -d ':' -f 3)"
  HOUR="$(line "${line}" | cut -d ' ' -f 2 | cut -d ':' -f 2)"

  echo "${line}" >> ${HOUR}_${MINUTE}.log

done < ${LOGFILE}
 
Old 03-23-2016, 05:09 AM   #12
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
http://www.linuxquestions.org/questi...1/#post5076156
 
Old 03-23-2016, 07:38 AM   #13
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,642
Blog Entries: 4

Rep: Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933
Another thing that I saw done, which was rather interesting (and with a different sort of log ...) was to take random samples from the log file and analyze them. IIRC, you could configure upper and lower bounds for: the starting position, the number of records to be sampled, how many to select from each range, and how many samples to take. With surprisingly small amounts of data, very useful statistics could be obtained on-the-fly without impeding the other processes which were continuously writing to it.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Open-Source Apache Web Server Hits Ignominious Milestone LXer Syndicated Linux News 0 08-12-2013 07:30 PM
script to search and count hits from some country mad_penguin Programming 5 03-26-2011 04:02 AM
finding the hits in web server crackerB Linux - Software 2 12-20-2007 08:32 AM
How to log Apache server hits? javedmk80 Linux - Server 3 12-11-2007 05:02 AM
quick mysql question - logic with COUNT BrianK Programming 3 09-29-2005 04:29 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:14 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration