LinuxQuestions.org - median average

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - median average (https://www.linuxquestions.org/questions/programming-9/median-average-927125/)

frater

02-02-2012 04:50 AM

median average

I just created a thread about adding pipe to a little perl-script that calculated the median value.

But I want a bit more than that.
I would like to have a script that trashes the values of the highest 20% and lowest 20% and does an average on these values.

I don't know if 20% is a good value, but that's another subject.
It's a sort of median/avg hybrid

What is the neatest way to do this?
I'm thinking of awk / perl.

Often I use this 'sort | uniq -c | sort -n' on a log-file to find out what's happening often...

Code:

# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n

      1 Connected: 109.169.179.63

      1 Connected: 109.236.86.189

      1 Connected: 119.73.215.163

      1 Connected: 174.133.45.124

      1 Connected: 182.178.139.75

      1 Connected: 186.1.119.12

      1 Connected: 195.173.77.133

      1 Connected: 46.246.131.84

      1 Connected: 62.149.157.187

      1 Connected: 64.131.89.11

      1 Connected: 67.19.250.154

      1 Connected: 74.205.124.9

      1 Connected: 84.47.241.74

      1 Connected: 85.17.180.231

      1 Connected: 87.233.23.21

      2 Connected: 46.105.97.103

      3 Connected: 216.34.181.88

    352 Connected: 45.103.94.104

    702 Connected: 192.168.10.100

If don't want to let the 352 and 702 to spoil the average value...

Cedrik

02-02-2012 06:56 AM

20% of what ? The average ?

frater

02-02-2012 07:18 AM

If the file is 100 lines long it should discard the first 20 lines and the last 20 lines.

Cedrik

02-02-2012 07:36 AM

And the average is the average number of times a host is connected ?

frater

02-02-2012 07:49 AM

I want to use it in a generic way,
but yes in this example it's the average number of times a host is connected.

This does it in an ugly way....

Quote:

ftmp1=`mktemp`
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n >${ftmp1}
discard=`grep -c '' ${ftmp1} | awk '{printf "%d", $1*.2}'`
avg=`tail -n+$((1 + ${discard})) ${ftmp1} | head -n-${discard} | awk '{avg += $1}END{printf "%d\n", avg/NR}'`

The first 2 lines should not be part of the code.
It's only there to generate the data

Quote:

# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | medianavg
2

The script medianavg (or awk-oneliner) is the thing I'm interested in.

In another thread I'm given a perl-script that does the "median"

Quote:

/usr/local/sbin# cat median
#!/usr/bin/perl

my @nums;

do {
while(<>) {
chomp;
push @nums, split;
}
} unless $#ARGV > -1;

@nums = @ARGV if $#ARGV > -1;
@nums = sort {$a <=> $b} @nums;
my $med = $nums[($#nums / 2)];

print $med . "\n";

Cedrik

02-02-2012 08:17 AM

avg.pl:

Code:

#!/usr/bin/perl



# ash of IPs

# $ips{ip number} = sum of connections



my %ips;



while(<>) {

        next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;

        $ips{$1}++;

}



# count of IPs

my $total = keys %ips;



# 20% of total

my $x = int(20 * $total / 100);



# list (sorted) of connection sums minus highest 20% and lowest 20%

my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];



# recompute the count from this list

$total = @list;



# sum of connection sums

my $sum = 0;

map { $sum+=$_ } @list;



# connections average

my $avg = $sum / $total;

print "$avg\n";

Code:

./avg.pl maillog.txt

1

Cedrik

02-02-2012 08:20 AM

median.pl

Code:

#!/usr/bin/perl



my %ips;



while(<>) {

        next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;

        $ips{$1}++;

}



# count of IPs

my $total = keys %ips;



# 20% of total

my $x = int(20 * $total / 100);



# list (sorted) of connections sums minus highest 20% and lowest 20%

my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];



my $med = $list[($#list / 2)];



print $med . "\n";

Code:

./median.pl maillog.txt

1

AnanthaP

02-02-2012 07:27 PM

If the sample you posted is representative you should look at the last two entries given that they constitute more than 99% of all connections.

OK

frater

02-03-2012 02:50 AM

Thanks...
I would really like to use this code in a more generic way.
I've tried to use that code as a base for such a script, but my knowledge of perl is simply insufficient.
I do learn a lot from examples I'm given.

As far as I can gather this code is more or less expecting a "Connected".
The whole IP thing is just an example.

Sometimes I have

Code:

"grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n"

, but another time it's

Code:

I also want to use it like this

Code:

medianavg 43 45 76 76 76 345

medianavg <file>

cat <file> | medianavg

The code should just extract the numerical value of the first column, ignoring non-numerical values.
In the end it should just return the average value ignoring 20% of the highest value and 20% of the lowest value.

I promise not to be lazy and ask for code everytime.

Cedrik

02-03-2012 04:57 AM

For the Connected match, you want to just match the IPs, no ?
(if Connected or Helo invalid, it seems you get IPs in both cases)

If so, replace:

Code:

next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;

With:

Code:

next unless /(\d+\.\d+\.\d+\.\d+)/;

frater

02-03-2012 05:34 AM

@cedrik, no, no

We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....

Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...

So the input of the code is columned data in which only the first column is of any relevance. This column is numerical.

This would have been sufficient if I only needed the plain average:

Code:

.... | awk '{avg += $1}END{printf "%d\n", avg/NR}'

Now I want a replacement for the awk-onenliner.
It has to discard a portion of the deviant values, which IMHO is often a more representative average.
This can be done by sorting all the values, cutting a portion of both sides and take the average of that.

You already did the real work (and my thanks for your time).
I just don't know how to turn your code into a replacement of that awk oneliner.

BTW... this is fine (no need for the 3rd option)

Quote:

medianavg <file>
cat <file> | medianavg

Cedrik

02-03-2012 06:18 AM

Quote:

Originally Posted by frater (Post 4592744)

But this is what my perl examples do...
(they uniq, count, sort...)

frater

02-03-2012 11:13 AM

I think it's because I'm calling the script from bash.
i really don't know enough how to convert it.

Dark_Helmet

02-03-2012 11:27 AM

I have skimmed the thread, but unless I missed something (entirely possible), doesn't your solution just add a few lines to bash?

For instance:

Code:

lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )

linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)

let numFilteredLines=(lineCount*linePct)/100

let headLines=lineCount-numFilteredLines

let tailLines=headLines-numFilteredLines

grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | head -n ${headLines} | tail -n ${tailLines}'

Or, using sed (instead of head-tail):

Code:

lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )

linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)

let lineStart=(lineCount*linePct)/100

let lineStop=lineCount-lineStart

grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | sed -n "${lineStart},${lineStop} p"

Caveat: I have not tested either, but the concepts are sound... provided I understood the problem.

EDIT:
The head-tail approach is flawed... I will correct it momentarily...

EDIT2:
I believe the head-tail approach is fixed now.

EDIT3:
Well, another problem: the original lineCount was counting all the lines in the log versus the processed pipeline (which included a text-modifying 'uniq'). So the pipeline command has been used for the lineCount to be correct.

All times are GMT -5. The time now is 10:28 PM.