Logic to count per minutes hits on web-server log.

pix9 · 03-15-2016, 02:01 AM

hello Friends,

I am trying to figure out a simple logic to pull out the count of per-minute traffic on website from access logs. I don't want you to write down any code for me any simple math that I can use over here would be great. In meanwhile I will try to scribble some awk for this.

regards
Pix

cliffordw · 03-15-2016, 05:07 AM

Hi there,

One crude approach would be to simply cut the timestamp out of the log, and use "uniq -c" to count the lines per timestamp. Something like this:

Code:

cut -d'[' -f2 access_log | cut -d':' -f1-3 |uniq -c
     20 15/Mar/2016:07:35
     20 15/Mar/2016:07:36
      2 15/Mar/2016:07:37
      3 15/Mar/2016:08:34
     36 15/Mar/2016:08:47
     36 15/Mar/2016:08:48
      9 15/Mar/2016:09:03
     72 15/Mar/2016:09:08
     36 15/Mar/2016:09:25
      1 15/Mar/2016:10:19

I'm sure you can achieve the same in awk/perl/whatever, and for timestamps in any format necessary.

Hope this helps...

sundialsvcs · 03-15-2016, 08:12 PM

The best approach is to treat this as a statistical project. Therefore, "gather after-the-fact log files, and treat them strictly as a data set.

You will find that existing stats packages ... SAS®, SPSS®, and (open source ...) "R" ... are ready to help you.

The raw data-set provides you with important values, such as "by-the-second timestamps" in various well-known binary or ASCII formats. You will now perform various summary statistics against these data, having grouped them by various mathematical functions applied to them. (It may right-now sound rather strange, to many of you, when I say that SQL's GROUP BY capability is, in fact, a simplification(!) of this idea.)

pix9 · 03-16-2016, 01:17 AM

Quote:

Originally Posted by cliffordw

Hi there,

One crude approach would be to simply cut the timestamp out of the log, and use "uniq -c" to count the lines per timestamp. Something like this:

Code:

cut -d'[' -f2 access_log | cut -d':' -f1-3 |uniq -c
     20 15/Mar/2016:07:35
     20 15/Mar/2016:07:36
      2 15/Mar/2016:07:37
      3 15/Mar/2016:08:34
     36 15/Mar/2016:08:47
     36 15/Mar/2016:08:48
      9 15/Mar/2016:09:03
     72 15/Mar/2016:09:08
     36 15/Mar/2016:09:25
      1 15/Mar/2016:10:19

I'm sure you can achieve the same in awk/perl/whatever, and for timestamps in any format necessary.

Hope this helps...

Sure this is a simpler approach I will try to wind up my logic around this.
Thank You everyone for the replies.

regards

pix9 · 03-16-2016, 02:24 AM

Dear cliffordw your suggested thing works well, but IT only gives me count, I would also like separate certain bad requests on servers
for eg
home many requests where HTTP 200 responses and how many were 500 in particular minutes.

Now question here is which approach would be optimal one
Trying to achieve both in single instance ( get result in one complex looking awk query) or
Do same by executing two different jobs( I mean running two different commands and processing same file twice ) on same file.

regards

pix9 · 03-16-2016, 02:32 AM

Quote:

Originally Posted by sundialsvcs

The best approach is to treat this as a statistical project. Therefore, "gather after-the-fact log files, and treat them strictly as a data set.

You will find that existing stats packages ... SAS®, SPSS®, and (open source ...) "R" ... are ready to help you.

The raw data-set provides you with important values, such as "by-the-second timestamps" in various well-known binary or ASCII formats. You will now perform various summary statistics against these data, having grouped them by various mathematical functions applied to them. (It may right-now sound rather strange, to many of you, when I say that SQL's GROUP BY capability is, in fact, a simplification(!) of this idea.)

Are you suggesting to put things on SQL?
to be honest this is kind of statical project, I am doing it for own analysis.
I will try to find out the readily available packages, I don't know if I really have authority to use these on prod servers on the other hand I have freedom to make use of awk and grep ;-) which is what I ma trying to leverage here.

regards

sundialsvcs · 03-16-2016, 09:06 AM

No, I'm not suggesting the use of SQL, per se, unless it is otherwise indicated. What I was referring to is simply the notion of grouping, and pointing out that SQL provides that capability albeit in a limited form.

Lots of companies wind up abusing using Microsoft Excel and/or sometimes Microsoft Access (SQL ...) for this sort of thing. "R" is a curious, entirely open-source and very powerful "true statistics package."

You're analyzing server log-files after the fact, taking time-stamps and rolling them to form groups. Then, you're interested in statistical measures, particularly mean (average ...) and standard deviation.

Also(!) "google it." A search for apache log file analyzer produces many, many interesting hits.

Ramurd · 03-18-2016, 03:44 PM

An apache log line contains handy information and as such one easy way to work with it is to read the log file with a (perl) script for example, and upon certain values increase counters or make hashes/arrays; however, you need to be sure what format the logs actually provide (and if you change it, you may have to change your script accordingly); perl was my language of choice due to good documentation and very good performance on text data.

Myself I had to deal with something of the same, but the traffic became too high to put it in simple one-liners (and the various kinds of statistics made it hard as well) so I decided to read the log files on the fly (the perl style of 'tail -f') and chunk down each part of the log line and put it in a database. Now I could compute all kinds of statistics with simple SQL queries, and usually I have the statistics I want within a few minutes at most. It's most flexible.

I have simple queries calculating average hits per hour based on the same weekday and hour based on the data of the past X months. The database now contains 700 million rows of log lines, so that should count as quite some data; and runs over a course of a few years... but with it I can get some interesting trending data.

aspire1 · 03-18-2016, 08:26 PM

You could also use the pretty common ELK stack rather than rolling your own (elasticsearch/logstash/kibana) and google some tutorials.

pix9 · 03-22-2016, 02:40 AM

Hello everyone, currently I am using following method for doing this. If any better way kindly let me know.

for COUNT in $(number of minutes)
do
ZERODATE=$(date +"%d/%b/%Y:%H:%M" -d "+ ${COUNT} minute")
grep "${ZERODATE}" ${LOGFILE} |tee /tmp/${ZERODATE}.log
$ foo-command /tmp/${ZERODATE}.log
sleep 1
done

regards

Ramurd · 03-23-2016, 03:32 AM

Quote:

Originally Posted by pix9

Hello everyone, currently I am using following method for doing this. If any better way kindly let me know.

for COUNT in $(number of minutes)
do
ZERODATE=$(date +"%d/%b/%Y:%H:%M" -d "+ ${COUNT} minute")
grep "${ZERODATE}" ${LOGFILE} |tee /tmp/${ZERODATE}.log
$ foo-command /tmp/${ZERODATE}.log
sleep 1
done

regards

So, for metrics of each day you'd have to iterate 1440 times over the same logfile? That hardly sounds efficient; 60 times of iterations per hour... not that efficient either; I'd rather iterate once over the file and do all computations in that iteration.

Below something along the same lines; more efficient, but not a best practice either:

Code:

while read line
do
  MINUTE="$(echo "${line}" | cut -d ' ' -f 2 | cut -d ':' -f 3)"
  HOUR="$(line "${line}" | cut -d ' ' -f 2 | cut -d ':' -f 2)"

  echo "${line}" >> ${HOUR}_${MINUTE}.log

done < ${LOGFILE}

Habitual · 03-23-2016, 05:09 AM

http://www.linuxquestions.org/questi...1/#post5076156

sundialsvcs · 03-23-2016, 07:38 AM

Another thing that I saw done, which was rather interesting (and with a different sort of log ...) was to take random samples from the log file and analyze them. IIRC, you could configure upper and lower bounds for: the starting position, the number of records to be sampled, how many to select from each range, and how many samples to take. With surprisingly small amounts of data, very useful statistics could be obtained on-the-fly without impeding the other processes which were continuously writing to it.