LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-02-2012, 04:50 AM   #1
frater
Member
 
Registered: Jul 2008
Posts: 121

Rep: Reputation: 23
median average


I just created a thread about adding pipe to a little perl-script that calculated the median value.

But I want a bit more than that.
I would like to have a script that trashes the values of the highest 20% and lowest 20% and does an average on these values.

I don't know if 20% is a good value, but that's another subject.
It's a sort of median/avg hybrid

What is the neatest way to do this?
I'm thinking of awk / perl.

Often I use this 'sort | uniq -c | sort -n' on a log-file to find out what's happening often...

Code:
# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n
      1 Connected: 109.169.179.63
      1 Connected: 109.236.86.189
      1 Connected: 119.73.215.163
      1 Connected: 174.133.45.124
      1 Connected: 182.178.139.75
      1 Connected: 186.1.119.12
      1 Connected: 195.173.77.133
      1 Connected: 46.246.131.84
      1 Connected: 62.149.157.187
      1 Connected: 64.131.89.11
      1 Connected: 67.19.250.154
      1 Connected: 74.205.124.9
      1 Connected: 84.47.241.74
      1 Connected: 85.17.180.231
      1 Connected: 87.233.23.21
      2 Connected: 46.105.97.103
      3 Connected: 216.34.181.88
    352 Connected: 45.103.94.104
    702 Connected: 192.168.10.100
If don't want to let the 352 and 702 to spoil the average value...
 
Old 02-02-2012, 06:56 AM   #2
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
20% of what ? The average ?
 
Old 02-02-2012, 07:18 AM   #3
frater
Member
 
Registered: Jul 2008
Posts: 121

Original Poster
Rep: Reputation: 23
If the file is 100 lines long it should discard the first 20 lines and the last 20 lines.
 
Old 02-02-2012, 07:36 AM   #4
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
And the average is the average number of times a host is connected ?
 
Old 02-02-2012, 07:49 AM   #5
frater
Member
 
Registered: Jul 2008
Posts: 121

Original Poster
Rep: Reputation: 23
I want to use it in a generic way,
but yes in this example it's the average number of times a host is connected.

This does it in an ugly way....

Quote:
ftmp1=`mktemp`
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n >${ftmp1}
discard=`grep -c '' ${ftmp1} | awk '{printf "%d", $1*.2}'`
avg=`tail -n+$((1 + ${discard})) ${ftmp1} | head -n-${discard} | awk '{avg += $1}END{printf "%d\n", avg/NR}'`
The first 2 lines should not be part of the code.
It's only there to generate the data

Quote:
# grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | medianavg
2
The script medianavg (or awk-oneliner) is the thing I'm interested in.

In another thread I'm given a perl-script that does the "median"
Quote:
/usr/local/sbin# cat median
#!/usr/bin/perl

my @nums;

do {
while(<>) {
chomp;
push @nums, split;
}
} unless $#ARGV > -1;

@nums = @ARGV if $#ARGV > -1;
@nums = sort {$a <=> $b} @nums;
my $med = $nums[($#nums / 2)];

print $med . "\n";

Last edited by frater; 02-02-2012 at 07:56 AM.
 
Old 02-02-2012, 08:17 AM   #6
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
avg.pl:
Code:
#!/usr/bin/perl

# ash of IPs
# $ips{ip number} = sum of connections

my %ips;

while(<>) {
	next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;
	$ips{$1}++;
}

# count of IPs
my $total = keys %ips;

# 20% of total
my $x = int(20 * $total / 100);

# list (sorted) of connection sums minus highest 20% and lowest 20%
my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];

# recompute the count from this list
$total = @list;

# sum of connection sums
my $sum = 0;
map { $sum+=$_ } @list;

# connections average
my $avg = $sum / $total;
print "$avg\n";
Code:
./avg.pl maillog.txt
1

Last edited by Cedrik; 02-02-2012 at 08:36 AM.
 
Old 02-02-2012, 08:20 AM   #7
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
median.pl
Code:
#!/usr/bin/perl

my %ips;

while(<>) {
	next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;
	$ips{$1}++;
}

# count of IPs
my $total = keys %ips;

# 20% of total
my $x = int(20 * $total / 100);

# list (sorted) of connections sums minus highest 20% and lowest 20%
my @list = (sort {$a <=> $b} values %ips)[$x .. ($total -$x)];

my $med = $list[($#list / 2)];

print $med . "\n";
Code:
./median.pl maillog.txt
1

Last edited by Cedrik; 02-02-2012 at 08:37 AM.
 
Old 02-02-2012, 07:27 PM   #8
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Posts: 952

Rep: Reputation: 217Reputation: 217Reputation: 217
If the sample you posted is representative you should look at the last two entries given that they constitute more than 99% of all connections.

OK
 
Old 02-03-2012, 02:50 AM   #9
frater
Member
 
Registered: Jul 2008
Posts: 121

Original Poster
Rep: Reputation: 23
Thanks...
I would really like to use this code in a more generic way.
I've tried to use that code as a base for such a script, but my knowledge of perl is simply insufficient.
I do learn a lot from examples I'm given.

As far as I can gather this code is more or less expecting a "Connected".
The whole IP thing is just an example.

Sometimes I have
Code:
"grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n"
, but another time it's
Code:
"cat maillog.txt | grep 'Helo invalid' | egrep -o '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' | sort | uniq -c | sort -n"
I also want to use it like this

Code:
medianavg 43 45 76 76 76 345
medianavg <file>
cat <file> | medianavg
The code should just extract the numerical value of the first column, ignoring non-numerical values.
In the end it should just return the average value ignoring 20% of the highest value and 20% of the lowest value.

I promise not to be lazy and ask for code everytime.
 
Old 02-03-2012, 04:57 AM   #10
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
For the Connected match, you want to just match the IPs, no ?
(if Connected or Helo invalid, it seems you get IPs in both cases)

If so, replace:
Code:
next unless /Connected: (\d+\.\d+\.\d+\.\d+)/;
With:
Code:
next unless /(\d+\.\d+\.\d+\.\d+)/;
 
Old 02-03-2012, 05:34 AM   #11
frater
Member
 
Registered: Jul 2008
Posts: 121

Original Poster
Rep: Reputation: 23
@cedrik, no, no

We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....

Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...

So the input of the code is columned data in which only the first column is of any relevance. This column is numerical.

This would have been sufficient if I only needed the plain average:
Code:
.... | awk '{avg += $1}END{printf "%d\n", avg/NR}'
Now I want a replacement for the awk-onenliner.
It has to discard a portion of the deviant values, which IMHO is often a more representative average.
This can be done by sorting all the values, cutting a portion of both sides and take the average of that.

You already did the real work (and my thanks for your time).
I just don't know how to turn your code into a replacement of that awk oneliner.

BTW... this is fine (no need for the 3rd option)
Quote:
medianavg <file>
cat <file> | medianavg

Last edited by frater; 02-03-2012 at 05:52 AM.
 
Old 02-03-2012, 06:18 AM   #12
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by frater View Post
@cedrik, no, no
We have a misunderstanding here...
I don't want code for parsing this maillog in peticular. I want a function that I can use in any situation.
So, "IP" or "Connected" has nothing to do with it....
Now I come to think of it.
I'm so focused on solving things in bash, but maybe it's even better to include the "sort | uniq -c | sort -n" inside the perl code too....
But that's another step I don't want to take right now...
But this is what my perl examples do...
(they uniq, count, sort...)
 
Old 02-03-2012, 11:13 AM   #13
frater
Member
 
Registered: Jul 2008
Posts: 121

Original Poster
Rep: Reputation: 23
I think it's because I'm calling the script from bash.
i really don't know enough how to convert it.
 
Old 02-03-2012, 11:27 AM   #14
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
I have skimmed the thread, but unless I missed something (entirely possible), doesn't your solution just add a few lines to bash?

For instance:
Code:
lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )
linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)
let numFilteredLines=(lineCount*linePct)/100
let headLines=lineCount-numFilteredLines
let tailLines=headLines-numFilteredLines
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | head -n ${headLines} | tail -n ${tailLines}'
Or, using sed (instead of head-tail):
Code:
lineCount=$( grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | wc -l )
linePct=20 #Percentage -- must specify as whole number (e.g. NOT 0.2)
let lineStart=(lineCount*linePct)/100
let lineStop=lineCount-lineStart
grep -o 'Connected: [0-9.]*' maillog.txt | sort | uniq -c | sort -n | sed -n "${lineStart},${lineStop} p"
Caveat: I have not tested either, but the concepts are sound... provided I understood the problem.

EDIT:
The head-tail approach is flawed... I will correct it momentarily...

EDIT2:
I believe the head-tail approach is fixed now.

EDIT3:
Well, another problem: the original lineCount was counting all the lines in the log versus the processed pipeline (which included a text-modifying 'uniq'). So the pipeline command has been used for the lineCount to be correct.

Last edited by Dark_Helmet; 02-03-2012 at 11:54 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
C code for median filter on color images tomazN Programming 123 01-27-2021 03:01 PM
Load average jacppe Linux - Newbie 1 01-20-2011 02:33 PM
C-code for median and sobel operators using pthreads katerinake Programming 1 08-23-2007 04:57 PM
Applying median filter to a picture ... tomazN Programming 7 04-03-2006 05:15 AM
Average load Cyth Linux - General 1 01-22-2002 03:33 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:15 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration