median average

frater · 02-02-2012, 04:50 AM

I just created a thread about adding pipe to a little perl-script that calculated the median value.

But I want a bit more than that.
I would like to have a script that trashes the values of the highest 20% and lowest 20% and does an average on these values.

I don't know if 20% is a good value, but that's another subject.
It's a sort of median/avg hybrid

What is the neatest way to do this?
I'm thinking of awk / perl.

Often I use this 'sort | uniq -c | sort -n' on a log-file to find out what's happening often...

Code:

# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n
      1 Connected: 109.169.179.63
      1 Connected: 109.236.86.189
      1 Connected: 119.73.215.163
      1 Connected: 174.133.45.124
      1 Connected: 182.178.139.75
      1 Connected: 186.1.119.12
      1 Connected: 195.173.77.133
      1 Connected: 46.246.131.84
      1 Connected: 62.149.157.187
      1 Connected: 64.131.89.11
      1 Connected: 67.19.250.154
      1 Connected: 74.205.124.9
      1 Connected: 84.47.241.74
      1 Connected: 85.17.180.231
      1 Connected: 87.233.23.21
      2 Connected: 46.105.97.103
      3 Connected: 216.34.181.88
    352 Connected: 45.103.94.104
    702 Connected: 192.168.10.100

If don't want to let the 352 and 702 to spoil the average value...

Cedrik · 02-02-2012, 06:56 AM

20% of what ? The average ?

frater · 02-02-2012, 07:18 AM

If the file is 100 lines long it should discard the first 20 lines and the last 20 lines.

Cedrik · 02-02-2012, 07:36 AM

And the average is the average number of times a host is connected ?

frater · 02-02-2012, 07:49 AM

I want to use it in a generic way,
but yes in this example it's the average number of times a host is connected.

This does it in an ugly way....

Quote:

ftmp1=`mktemp`
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n >${ftmp1}
discard=`grep -c '' ${ftmp1} | awk '{printf "%d", $1*.2}'`
avg=`tail -n+$((1 + ${discard})) ${ftmp1} | head -n-${discard} | awk '{avg += $1}END{printf "%d\n", avg/NR}'`

The first 2 lines should not be part of the code.
It's only there to generate the data

Quote:

# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | medianavg
2

The script medianavg (or awk-oneliner) is the thing I'm interested in.

In another thread I'm given a perl-script that does the "median"

Quote:

/usr/local/sbin# cat median
#!/usr/bin/perl

my @nums;

do {
while(<>) {
chomp;
push @nums, split;
}
} unless $#ARGV > -1;

@nums = @ARGV if $#ARGV > -1;
@nums = sort {$a <=> $b} @nums;
my $med = $nums[($#nums / 2)];

print $med . "\n";

Cedrik · 02-02-2012, 08:17 AM

avg.pl:

Code:

#!/usr/bin/perl

# ash of IPs
# $ips{ip number} = sum of connections

my %ips;

while(<>) {
	next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;
	$ips{$1}++;
}

# count of IPs
my $total = keys %ips;

# 20% of total
my $x = int(20 * $total / 100);

# list (sorted) of connection sums minus highest 20% and lowest 20%
my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];

# recompute the count from this list
$total = @list;

# sum of connection sums
my $sum = 0;
map { $sum+=$_ } @list;

# connections average
my $avg = $sum / $total;
print "$avg\n";

Code:

./avg.pl maillog.txt
1

Cedrik · 02-02-2012, 08:20 AM

median.pl

Code:

#!/usr/bin/perl

my %ips;

while(<>) {
	next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;
	$ips{$1}++;
}

# count of IPs
my $total = keys %ips;

# 20% of total
my $x = int(20 * $total / 100);

# list (sorted) of connections sums minus highest 20% and lowest 20%
my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];

my $med = $list[($#list / 2)];

print $med . "\n";

Code:

./median.pl maillog.txt
1

AnanthaP · 02-02-2012, 07:27 PM

If the sample you posted is representative you should look at the last two entries given that they constitute more than 99% of all connections.

OK

frater · 02-03-2012, 02:50 AM

Thanks...
I would really like to use this code in a more generic way.
I've tried to use that code as a base for such a script, but my knowledge of perl is simply insufficient.
I do learn a lot from examples I'm given.

As far as I can gather this code is more or less expecting a "Connected".
The whole IP thing is just an example.

Sometimes I have

Code:

"grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n"

, but another time it's

Code:

"cat maillog.txt | grep 'Helo invalid' | egrep -o '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort | uniq -c | sort -n"

I also want to use it like this

Code:

medianavg 43 45 76 76 76 345
medianavg <file>
cat <file> | medianavg

The code should just extract the numerical value of the first column, ignoring non-numerical values.
In the end it should just return the average value ignoring 20% of the highest value and 20% of the lowest value.

I promise not to be lazy and ask for code everytime.

Cedrik · 02-03-2012, 04:57 AM

For the Connected match, you want to just match the IPs, no ?
(if Connected or Helo invalid, it seems you get IPs in both cases)

If so, replace:

Code:

next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;

With:

Code:

next unless /(\d+\.\d+\.\d+\.\d+)/;

frater · 02-03-2012, 05:34 AM

@cedrik, no, no

We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....

Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...

So the input of the code is columned data in which only the first column is of any relevance. This column is numerical.

This would have been sufficient if I only needed the plain average:

Code:

.... | awk '{avg += $1}END{printf "%d\n", avg/NR}'

Now I want a replacement for the awk-onenliner.
It has to discard a portion of the deviant values, which IMHO is often a more representative average.
This can be done by sorting all the values, cutting a portion of both sides and take the average of that.

You already did the real work (and my thanks for your time).
I just don't know how to turn your code into a replacement of that awk oneliner.

BTW... this is fine (no need for the 3rd option)

Quote:

medianavg <file>
cat <file> | medianavg

Cedrik · 02-03-2012, 06:18 AM

Quote:

Originally Posted by frater

@cedrik, no, no
We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....
Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...

But this is what my perl examples do...
(they uniq, count, sort...)

frater · 02-03-2012, 11:13 AM

I think it's because I'm calling the script from bash.
i really don't know enough how to convert it.

Dark_Helmet · 02-03-2012, 11:27 AM

I have skimmed the thread, but unless I missed something (entirely possible), doesn't your solution just add a few lines to bash?

For instance:

Code:

lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )
linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)
let numFilteredLines=(lineCount*linePct)/100
let headLines=lineCount-numFilteredLines
let tailLines=headLines-numFilteredLines
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | head -n ${headLines} | tail -n ${tailLines}'

Or, using sed (instead of head-tail):

Code:

lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )
linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)
let lineStart=(lineCount*linePct)/100
let lineStop=lineCount-lineStart
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | sed -n "${lineStart},${lineStop} p"

Caveat: I have not tested either, but the concepts are sound... provided I understood the problem.

EDIT:
The head-tail approach is flawed... I will correct it momentarily...

EDIT2:
I believe the head-tail approach is fixed now.

EDIT3:
Well, another problem: the original lineCount was counting all the lines in the log versus the processed pipeline (which included a text-modifying 'uniq'). So the pipeline command has been used for the lineCount to be correct.