ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I just created a thread about adding pipe to a little perl-script that calculated the median value.
But I want a bit more than that.
I would like to have a script that trashes the values of the highest 20% and lowest 20% and does an average on these values.
I don't know if 20% is a good value, but that's another subject.
It's a sort of median/avg hybrid
What is the neatest way to do this?
I'm thinking of awk / perl.
Often I use this 'sort | uniq -c | sort -n' on a log-file to find out what's happening often...
#!/usr/bin/perl
# ash of IPs
# $ips{ip number} = sum of connections
my %ips;
while(<>) {
next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;
$ips{$1}++;
}
# count of IPs
my $total = keys %ips;
# 20% of total
my $x = int(20 * $total / 100);
# list (sorted) of connection sums minus highest 20% and lowest 20%
my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];
# recompute the count from this list
$total = @list;
# sum of connection sums
my $sum = 0;
map { $sum+=$_ } @list;
# connections average
my $avg = $sum / $total;
print "$avg\n";
Thanks...
I would really like to use this code in a more generic way.
I've tried to use that code as a base for such a script, but my knowledge of perl is simply insufficient.
I do learn a lot from examples I'm given.
As far as I can gather this code is more or less expecting a "Connected".
The whole IP thing is just an example.
The code should just extract the numerical value of the first column, ignoring non-numerical values.
In the end it should just return the average value ignoring 20% of the highest value and 20% of the lowest value.
I promise not to be lazy and ask for code everytime.
We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....
Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...
So the input of the code is columned data in which only the first column is of any relevance. This column is numerical.
This would have been sufficient if I only needed the plain average:
Now I want a replacement for the awk-onenliner.
It has to discard a portion of the deviant values, which IMHO is often a more representative average.
This can be done by sorting all the values, cutting a portion of both sides and take the average of that.
You already did the real work (and my thanks for your time).
I just don't know how to turn your code into a replacement of that awk oneliner.
@cedrik, no, no
We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....
Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...
But this is what my perl examples do...
(they uniq, count, sort...)
I have skimmed the thread, but unless I missed something (entirely possible), doesn't your solution just add a few lines to bash?
For instance:
Code:
lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )
linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)
let numFilteredLines=(lineCount*linePct)/100
let headLines=lineCount-numFilteredLines
let tailLines=headLines-numFilteredLines
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | head -n ${headLines} | tail -n ${tailLines}'
Or, using sed (instead of head-tail):
Code:
lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )
linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)
let lineStart=(lineCount*linePct)/100
let lineStop=lineCount-lineStart
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | sed -n "${lineStart},${lineStop} p"
Caveat: I have not tested either, but the concepts are sound... provided I understood the problem.
EDIT:
The head-tail approach is flawed... I will correct it momentarily...
EDIT2:
I believe the head-tail approach is fixed now.
EDIT3:
Well, another problem: the original lineCount was counting all the lines in the log versus the processed pipeline (which included a text-modifying 'uniq'). So the pipeline command has been used for the lineCount to be correct.
Last edited by Dark_Helmet; 02-03-2012 at 11:54 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.