median average
I just created a thread about adding pipe to a little perl-script that calculated the median value.
But I want a bit more than that. I would like to have a script that trashes the values of the highest 20% and lowest 20% and does an average on these values. I don't know if 20% is a good value, but that's another subject. It's a sort of median/avg hybrid What is the neatest way to do this? I'm thinking of awk / perl. Often I use this 'sort | uniq -c | sort -n' on a log-file to find out what's happening often... Code:
# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n |
20% of what ? The average ?
|
If the file is 100 lines long it should discard the first 20 lines and the last 20 lines.
|
And the average is the average number of times a host is connected ?
|
I want to use it in a generic way,
but yes in this example it's the average number of times a host is connected. This does it in an ugly way.... Quote:
It's only there to generate the data Quote:
In another thread I'm given a perl-script that does the "median" Quote:
|
avg.pl:
Code:
#!/usr/bin/perl Code:
./avg.pl maillog.txt |
median.pl
Code:
#!/usr/bin/perl Code:
./median.pl maillog.txt |
If the sample you posted is representative you should look at the last two entries given that they constitute more than 99% of all connections.
OK |
Thanks...
I would really like to use this code in a more generic way. I've tried to use that code as a base for such a script, but my knowledge of perl is simply insufficient. I do learn a lot from examples I'm given. As far as I can gather this code is more or less expecting a "Connected". The whole IP thing is just an example. Sometimes I have Code:
"grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n" Code:
"cat maillog.txt | grep 'Helo invalid' | egrep -o '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort | uniq -c | sort -n" Code:
medianavg 43 45 76 76 76 345 In the end it should just return the average value ignoring 20% of the highest value and 20% of the lowest value. I promise not to be lazy and ask for code everytime. |
For the Connected match, you want to just match the IPs, no ?
(if Connected or Helo invalid, it seems you get IPs in both cases) If so, replace: Code:
next unless /Connected: (\d+\.\d+\.\d+\.\d+)/; Code:
next unless /(\d+\.\d+\.\d+\.\d+)/; |
@cedrik, no, no
We have a misunderstanding here... I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation. So, "IP" or "Connected" has nothing to do with it.... Now I come to think of it. I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too.... But that's another step I don't want to take right now... So the input of the code is columned data in which only the first column is of any relevance. This column is numerical. This would have been sufficient if I only needed the plain average: Code:
.... | awk '{avg += $1}END{printf "%d\n", avg/NR}' It has to discard a portion of the deviant values, which IMHO is often a more representative average. This can be done by sorting all the values, cutting a portion of both sides and take the average of that. You already did the real work (and my thanks for your time). I just don't know how to turn your code into a replacement of that awk oneliner. BTW... this is fine (no need for the 3rd option) Quote:
|
Quote:
(they uniq, count, sort...) |
I think it's because I'm calling the script from bash.
i really don't know enough how to convert it. |
I have skimmed the thread, but unless I missed something (entirely possible), doesn't your solution just add a few lines to bash?
For instance: Code:
lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l ) Code:
lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l ) EDIT: The head-tail approach is flawed... I will correct it momentarily... EDIT2: I believe the head-tail approach is fixed now. EDIT3: Well, another problem: the original lineCount was counting all the lines in the log versus the processed pipeline (which included a text-modifying 'uniq'). So the pipeline command has been used for the lineCount to be correct. |
All times are GMT -5. The time now is 10:28 PM. |