Logic to count per minutes hits on web-server log.
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Logic to count per minutes hits on web-server log.
hello Friends,
I am trying to figure out a simple logic to pull out the count of per-minute traffic on website from access logs. I don't want you to write down any code for me any simple math that I can use over here would be great. In meanwhile I will try to scribble some awk for this.
The best approach is to treat this as a statistical project. Therefore, "gather after-the-fact log files, and treat them strictly as a data set.
You will find that existing stats packages ... SASŪ, SPSSŪ, and (open source ...) "R" ... are ready to help you.
The raw data-set provides you with important values, such as "by-the-second timestamps" in various well-known binary or ASCII formats. You will now perform various summary statistics against these data, having grouped them by various mathematical functions applied to them. (It may right-now sound rather strange, to many of you, when I say that SQL's GROUP BY capability is, in fact, a simplification(!) of this idea.)
Dear cliffordw your suggested thing works well, but IT only gives me count, I would also like separate certain bad requests on servers
for eg
home many requests where HTTP 200 responses and how many were 500 in particular minutes.
Now question here is which approach would be optimal one
Trying to achieve both in single instance ( get result in one complex looking awk query) or
Do same by executing two different jobs( I mean running two different commands and processing same file twice ) on same file.
The best approach is to treat this as a statistical project. Therefore, "gather after-the-fact log files, and treat them strictly as a data set.
You will find that existing stats packages ... SASŪ, SPSSŪ, and (open source ...) "R" ... are ready to help you.
The raw data-set provides you with important values, such as "by-the-second timestamps" in various well-known binary or ASCII formats. You will now perform various summary statistics against these data, having grouped them by various mathematical functions applied to them. (It may right-now sound rather strange, to many of you, when I say that SQL's GROUP BY capability is, in fact, a simplification(!) of this idea.)
Are you suggesting to put things on SQL?
to be honest this is kind of statical project, I am doing it for own analysis.
I will try to find out the readily available packages, I don't know if I really have authority to use these on prod servers on the other hand I have freedom to make use of awk and grep ;-) which is what I ma trying to leverage here.
No, I'm not suggesting the use of SQL, per se, unless it is otherwise indicated. What I was referring to is simply the notion of grouping, and pointing out that SQL provides that capability albeit in a limited form.
Lots of companies wind up abusing using Microsoft Excel and/or sometimes Microsoft Access (SQL ...) for this sort of thing. "R" is a curious, entirely open-source and very powerful "true statistics package."
You're analyzing server log-files after the fact, taking time-stamps and rolling them to form groups. Then, you're interested in statistical measures, particularly mean (average ...) and standard deviation.
Also(!) "google it." A search for apache log file analyzer produces many, many interesting hits.
An apache log line contains handy information and as such one easy way to work with it is to read the log file with a (perl) script for example, and upon certain values increase counters or make hashes/arrays; however, you need to be sure what format the logs actually provide (and if you change it, you may have to change your script accordingly); perl was my language of choice due to good documentation and very good performance on text data.
Myself I had to deal with something of the same, but the traffic became too high to put it in simple one-liners (and the various kinds of statistics made it hard as well) so I decided to read the log files on the fly (the perl style of 'tail -f') and chunk down each part of the log line and put it in a database. Now I could compute all kinds of statistics with simple SQL queries, and usually I have the statistics I want within a few minutes at most. It's most flexible.
I have simple queries calculating average hits per hour based on the same weekday and hour based on the data of the past X months. The database now contains 700 million rows of log lines, so that should count as quite some data; and runs over a course of a few years... but with it I can get some interesting trending data.
Hello everyone, currently I am using following method for doing this. If any better way kindly let me know.
for COUNT in $(number of minutes)
do
ZERODATE=$(date +"%d/%b/%Y:%H:%M" -d "+ ${COUNT} minute")
grep "${ZERODATE}" ${LOGFILE} |tee /tmp/${ZERODATE}.log
$ foo-command /tmp/${ZERODATE}.log
sleep 1
done
regards
So, for metrics of each day you'd have to iterate 1440 times over the same logfile? That hardly sounds efficient; 60 times of iterations per hour... not that efficient either; I'd rather iterate once over the file and do all computations in that iteration.
Below something along the same lines; more efficient, but not a best practice either:
Another thing that I saw done, which was rather interesting (and with a different sort of log ...) was to take random samples from the log file and analyze them. IIRC, you could configure upper and lower bounds for: the starting position, the number of records to be sampled, how many to select from each range, and how many samples to take. With surprisingly small amounts of data, very useful statistics could be obtained on-the-fly without impeding the other processes which were continuously writing to it.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.