Shell script to remove outliers with IQR technique
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Shell script to remove outliers with IQR technique
Hi,
I wrote an highly unoptimized script that reads a one-column file containing values as the one and only argument.
As far as I've tested it, it does the job correctly, but slowly.
while IFS= read -r line; do
if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
sed "0,/$line/{/$line/d;}" $filename > $tmpfile
IS_DATA_OK=0
break
fi
done < "$filename"
I didn't try timing but you can eliminate the while loop, calling your function multiple times and using sed by just writing the matching numbers.
Code:
while IFS= read -r line; do
if [[ $line > $IQR_X15LOW && $line < $IQR_X15HIGH ]]; then
echo "$line" >> $tmpfile
fi
done < "$filename"
You can go even further by holding the file contents in an array, to avoid processing the file repeatedly.
I've included a simple example bash script which adds a list of simple integer values which it expects will be in a file whose name is passed as a command line argument, and deletes two numbers which it expects to be provided as the second and third command line arguments. It then adds the remaining numbers, then calculates the integer value of the average. It's just a simple example of how to use an array to reprocess a list of numbers, without having to reprocess a file external to the script time and again. I preferred to keep it simple just in case you weren't familiar with the bash features I was using, and because I found a study of statistics only up to the level of such things as standard deviation and two tailed hypothesis testing, was good enough for me. If I wanted to learn about mid hinge stuff, I'd study mechanics.
calc_fast.bash:
Code:
#!/bin/bash
debug=1
if [[ $debug -eq 1 ]]
then
set +x ;
fi
BOOBOO=-99
if [[ $BASH_ARGC != 3 ]]
then
echo "Usage: $0 data_file_name first_num_to_delete second_num_to_delete"
exit $BOOBOO
fi
if [ -r $1 ]
then
data_file_name="${1}"
if [[ $debug -eq 1 ]]
then
echo "data_file_name=${data_file_name}"
fi
else
exit $BOOBOO
fi
first_num_to_delete=$2
second_num_to_delete=$3
declare -a numbers
if [[ $debug -eq 1 ]]
then
cat ${data_file_name}
fi
# It seems that readarray and mapfile aren't working for me.
# So let's use a simple minded loop, instead. :-O
while [[ 0 -eq 0 ]]
do
read value ;
if [[ $debug -eq 1 ]]
then
echo 'value='${value} ;
fi
if [[ ${value} == "" ]]
then
break ;
fi
count=${#numbers[@]} ;
if [[ $debug -eq 1 ]]
then
echo 'count='${count} ;
fi
numbers[$count]=${value} ;
done < ${data_file_name}
if [[ $debug -eq 1 ]]
then
echo 'Arrays numbers='${numbers[@]}
fi
for index_num in "${numbers[@]}"
do
if [[ $debug -eq 1 ]]
then
echo "index_num=$index_num"
fi
if [[ ${numbers[$index_num]} -eq $first_num_to_delete ]]
then
unset numbers[$index_num] ;
fi
if [[ ${numbers[$index_num]} -eq $second_num_to_delete ]]
then
unset numbers[$index_num] ;
fi
done
sum=0
if [[ $debug -eq 1 ]]
then
echo 'Count of numbers='${#numbers[*]}
fi
for value in "${numbers[@]}"
do
if [[ $debug -eq 1 ]]
then
echo $value
fi
sum=$(( sum + value )) ;
done
echo "Sum=$sum,average=$(($sum/${#numbers[*]}))"
There is an equally simple data file attached, which I used to make sure the script works ala:
Code:
./calc_fast.bash input_numbers.txt 9 10 |& less
Naturally the calculations you are doing could be expected to take a few times as long as what this simple script is doing. But to give you an idea of the potential improvement in processing speed if not repeatedly reprocessing a file external to the script, when I fed this script your data file, the script took between 2 and 10 seconds to handle it on my machine, depending on what else was happening on the machine. My machine is NOT brand new, it's technology from a few years ago.
in general:
1. bash is not really good at mathematics
2. if you still want to calculate use $(( expression ))
3. avoid backticks - and $( ) if possible because they are slow
4. keep data in variables, not in files
so actually I would try to read the content into an array and work on that. All the head, tail and bc calls can be replaced and use some built-in functions. (probably that sed can be replaced too).
use bash builtins instead of external commands: even a complex shuffling of strings and variables is MUCH faster than something like "head -$q1 $filename | tail -1"
every pipe '|', every command substitution (backticks or $()) creates a sub-shell and is to be avoided in an often-repeated function
In addition the outliers in your sorted data are always < Q1 and > Q3. You can create loops just to check this range versus looking at the entire file.
I edited a bit the first post because I didn't mention the outlier removal procedure, which is the purpose of the script.
A loop is needed for the removal procedure.
To determine whether data contains an outlier:
1 Identify the point furthest from the mean of the data.
2 Determine whether that point is further than 1.5*IQR away from the mean.
3 If so, that point is an outlier and should be eliminated from the data resulting in a new set of data.
4 Repeat steps to determine if new data set contains an outlier until dataset no longer contains outlier.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.