Shell script to remove outliers with IQR technique

Linux.tar.gz · 09-25-2019, 03:23 PM

Hi,

I wrote an highly unoptimized script that reads a one-column file containing values as the one and only argument.
As far as I've tested it, it does the job correctly, but slowly.

IQR stands for InterQuartile Range:
https://en.wikipedia.org/wiki/Interquartile_range

The IQR outliers detection is explained here:
https://en.wikipedia.org/wiki/Interq...range#Outliers

The IQR outliers removal is explained here:
https://www.mathworks.com/matlabcent...quartile-range

I guess that there's a gazillion ways to improve it, that's why I'm posting it.
Also, because I didn't find a bash script doing this.

The script:

Code:

#!/bin/bash

filename="/dev/shm/tmp.txt"
sort -n $1 > $filename

tmpfile="/dev/shm/tmp2.txt"

IS_DATA_OK=0

IQRfilter () {

    rows=`wc -l $filename | cut -d' ' -f1`

    q2=`echo "($rows+1)/2" | bc`
    q1=`echo "$q2 / 2" | bc`
    q3=`echo "3 * $q1" | bc`

    Q1=`head -$q1 $filename | tail -1`
    #echo $Q1

    Q2=`head -$q2 $filename | tail -1`
    #echo $Q2

    Q3=`head -$q3 $filename | tail -1`
    #echo $Q3

    IQR=`echo "$Q3-$Q1" | bc`
    #echo $IQR

    IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
    #echo $IQR_X15LOW

    IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`
    #echo $IQR_X15HIGH
    
    while IFS= read -r line; do
        if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
        sed "0,/$line/{/$line/d;}" $filename > $tmpfile
        IS_DATA_OK=0
        break
        fi
    done < "$filename"
    
}

while ((IS_DATA_OK!=1))
do

   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile $filename

done

mv $filename ./output.txt

rm /dev/shm/tmp.txt /dev/shm/tmp2.txt

exit 0

A sample data file is attached.

michaelk · 09-25-2019, 10:40 PM

Code:

    while IFS= read -r line; do
        if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
        sed "0,/$line/{/$line/d;}" $filename > $tmpfile
        IS_DATA_OK=0
        break
        fi
    done < "$filename"

I didn't try timing but you can eliminate the while loop, calling your function multiple times and using sed by just writing the matching numbers.

Code:

    while IFS= read -r line; do
        if [[ $line > $IQR_X15LOW && $line < $IQR_X15HIGH ]]; then
        echo "$line" >> $tmpfile
        fi
    done < "$filename"

rigor · 09-25-2019, 11:56 PM

You can go even further by holding the file contents in an array, to avoid processing the file repeatedly.

I've included a simple example bash script which adds a list of simple integer values which it expects will be in a file whose name is passed as a command line argument, and deletes two numbers which it expects to be provided as the second and third command line arguments. It then adds the remaining numbers, then calculates the integer value of the average. It's just a simple example of how to use an array to reprocess a list of numbers, without having to reprocess a file external to the script time and again. I preferred to keep it simple just in case you weren't familiar with the bash features I was using, and because I found a study of statistics only up to the level of such things as standard deviation and two tailed hypothesis testing, was good enough for me. If I wanted to learn about mid hinge stuff, I'd study mechanics.

calc_fast.bash:

Code:

#!/bin/bash


debug=1

if [[  $debug  -eq  1  ]]
    then
        set +x ;
fi

BOOBOO=-99

if [[  $BASH_ARGC  !=  3  ]]
    then
        echo "Usage:  $0  data_file_name  first_num_to_delete  second_num_to_delete"
        
        exit $BOOBOO
fi

if [  -r  $1  ]
    then
        data_file_name="${1}"
        
        if [[  $debug  -eq  1  ]]
            then
                echo "data_file_name=${data_file_name}"
        fi

    else
        exit $BOOBOO
fi


first_num_to_delete=$2
second_num_to_delete=$3

declare -a  numbers

        
if [[  $debug  -eq  1  ]]
    then
        cat ${data_file_name}
fi

# It seems that readarray and mapfile aren't working for me.
# So let's use a simple minded loop, instead. :-O

while [[  0  -eq  0  ]]
    do
        read value ;

        if [[  $debug  -eq  1  ]]
            then
                echo 'value='${value} ;
        fi

       
        if [[  ${value}  ==  ""  ]]
            then
                break ;
        fi
 
        count=${#numbers[@]} ;

        if [[  $debug  -eq  1  ]]
            then
                echo 'count='${count} ;
        fi

        numbers[$count]=${value} ;
    done  <  ${data_file_name}


if [[  $debug  -eq  1  ]]
    then
        echo 'Arrays numbers='${numbers[@]}
fi

for  index_num in "${numbers[@]}"
    do
        if [[  $debug  -eq  1  ]]
            then
                echo "index_num=$index_num"
        fi

   
        if [[  ${numbers[$index_num]}  -eq  $first_num_to_delete  ]]
            then
                unset numbers[$index_num] ;
        fi
 
        if [[  ${numbers[$index_num]}  -eq  $second_num_to_delete  ]]
            then
                unset numbers[$index_num] ;
        fi
    done

sum=0

if [[  $debug  -eq  1  ]]
    then
        echo 'Count of numbers='${#numbers[*]}
fi


for  value in "${numbers[@]}"
    do
        if [[  $debug  -eq  1  ]]
            then
                echo $value
        fi

        sum=$(( sum + value  )) ;
    done

echo "Sum=$sum,average=$(($sum/${#numbers[*]}))"

There is an equally simple data file attached, which I used to make sure the script works ala:

Code:

./calc_fast.bash input_numbers.txt 9 10 |& less

Naturally the calculations you are doing could be expected to take a few times as long as what this simple script is doing. But to give you an idea of the potential improvement in processing speed if not repeatedly reprocessing a file external to the script, when I fed this script your data file, the script took between 2 and 10 seconds to handle it on my machine, depending on what else was happening on the machine. My machine is NOT brand new, it's technology from a few years ago.

HTH.

pan64 · 09-26-2019, 01:19 AM

in general:
1. bash is not really good at mathematics
2. if you still want to calculate use $(( expression ))
3. avoid backticks - and $( ) if possible because they are slow
4. keep data in variables, not in files

so actually I would try to read the content into an array and work on that. All the head, tail and bc calls can be replaced and use some built-in functions. (probably that sed can be replaced too).

ondoho · 09-26-2019, 01:47 AM

...more in general:

use bash builtins instead of external commands: even a complex shuffling of strings and variables is MUCH faster than something like "head -$q1 $filename | tail -1"
every pipe '|', every command substitution (backticks or $()) creates a sub-shell and is to be avoided in an often-repeated function

michaelk · 09-26-2019, 06:47 AM

In addition the outliers in your sorted data are always < Q1 and > Q3. You can create loops just to check this range versus looking at the entire file.

Linux.tar.gz · 09-26-2019, 02:39 PM

Thanks all !
I'm testing the codes.

I edited a bit the first post because I didn't mention the outlier removal procedure, which is the purpose of the script.
A loop is needed for the removal procedure.

Linux.tar.gz · 09-26-2019, 03:28 PM

This one goes so much faster !
0,778s vs 46,127s

Code:

#!/bin/bash

tmpfile="/dev/shm/tmp.txt"
sort -n $1 > $tmpfile

tmpfile2="/dev/shm/tmp2.txt"

IS_DATA_OK=0

IQRfilter () {

    rm $tmpfile2 2>/dev/null
    FOUND_OUTLIER=0
        
    rows=`wc -l $tmpfile | cut -d' ' -f1`

    q2=$(( ($rows+1)/2 ))
    q1=$(( $q2 / 2 ))
    q3=$(( 3 * $q1 ))

    Q1=`head -$q1 $tmpfile | tail -1`
    Q2=`head -$q2 $tmpfile | tail -1`
    Q3=`head -$q3 $tmpfile | tail -1`

    IQR=$(( $Q3-$Q1 ))

    IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
    IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`

    while IFS= read -r line; do
    
    if [[ $FOUND_OUTLIER=0 ]]; then
    
        if [[ ($line < $IQR_X15LOW || $line > $IQR_X15HIGH) ]]; then
            FOUND_OUTLIER=1
            IS_DATA_OK=0
        else
            echo $line >> $tmpfile2
        fi
        
    else
        echo $line >> $tmpfile2
    fi
        
    done < "$tmpfile"
    
}

while ((IS_DATA_OK!=1)); do
   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile2 $tmpfile
done

mv $tmpfile ./output.txt

exit 0

I can't use

Code:

$(( ))

for

Code:

IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`

Probably because of the float.

michaelk · 09-26-2019, 03:41 PM

Code:

while ((IS_DATA_OK!=1)); do
   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile2 $tmpfile
done

The while loop is not necessary. All you need to do is call your IQR filter function.

Code:

    while IFS= read -r line; do
        
        if [[ ($line > $IQR_X15LOW && $line < $IQR_X15HIGH) ]]; then
            echo $line >> $tmpfile2
        fi               
    done < "$tmpfile"

This loops through your data once and eliminates the outliers. My time was 0.557s. Not a big improvement but more efficient.

Linux.tar.gz · 09-26-2019, 06:24 PM

But the loop is needed !

Quote:

To determine whether data contains an outlier:
1 Identify the point furthest from the mean of the data.
2 Determine whether that point is further than 1.5*IQR away from the mean.
3 If so, that point is an outlier and should be eliminated from the data resulting in a new set of data.
4 Repeat steps to determine if new data set contains an outlier until dataset no longer contains outlier.

https://www.mathworks.com/matlabcent...quartile-range

Firerat · 09-26-2019, 06:44 PM

much faster

Code:

#!/bin/bash
IQRfilter () {
Sorted=( ${Sorted[@]} )
IQR=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} - ${Sorted[ ${#Sorted[@]} / 4 ]} ))
IQR_X15LOW=$(( ${Sorted[ ${#Sorted[@]} / 4 ]} - $(( 3 * $IQR / 2 )) ))
IQR_X15HIGH=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} + $(( 3 * $IQR / 2 )) ))
}
CheckOL () {
check=${#Sorted[@]}
for i in ${!Sorted[@]}
do
  [[ ${Sorted[i]} -lt $IQR_X15LOW || ${Sorted[i]} -gt $IQR_X15HIGH ]] \
    && unset Sorted[$i] 
done
set +x
[[ ${#Sorted[@]} == ${check} ]] && return 0
return 1
}
Output () {
for i in "${Sorted[@]}"
do
    echo $i 
done
}
trap Output EXIT
Sorted=( $(sort -n "$1") )
while :
do
    IQRfilter
    CheckOL && ((c++))
    [[ $c -gt 1 ]] && exit
done
exit

Firerat · 09-26-2019, 06:55 PM

I forgot to move my check

change last bit to

Code:

while :
do
    IQRfilter
    CheckOL || continue
exit
done

won't make that much difference

michaelk · 09-26-2019, 08:01 PM

Ok, I got it...

Firerat · 09-27-2019, 01:18 PM

faster still

Code:

#!/bin/bash
IQRfilter () {
Sorted=( ${Sorted[@]} )
IQR=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} - ${Sorted[ ${#Sorted[@]} / 4 ]} ))
IQR_X15LOW=$(( ${Sorted[ ${#Sorted[@]} / 4 ]} - $(( 3 * $IQR / 2 )) ))
IQR_X15HIGH=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} + $(( 3 * $IQR / 2 )) ))
check=${#Sorted[@]}
while :
do
    [[ ${Sorted[0]} -lt $IQR_X15LOW ]] \
        && unset Sorted[0] || break
done
while :
do
    [[ ${Sorted[(-1)]} -gt $IQR_X15HIGH ]] \
        && unset Sorted[\(-1\)] || break
done
[[ ${#Sorted[@]} == ${check} ]] && return 0
return 1
}
Output () {
for i in "${Sorted[@]}"
do
    echo $i 
done
}
trap Output EXIT
Sorted=( $(sort -n "$1") )
while :
do
    IQRfilter || continue
exit
done

Linux.tar.gz · 09-28-2019, 09:23 PM

Thanks, I'll check your code !

For some reason, the

Code:

if [[ ($line < $IQR_X15LOW || $line > $IQR_X15HIGH) ]]; then

can't filter all outliers.
I had to use

Code:

if (( $line < $IQR_X15LOW )) || (( $line > $IQR_X15HIGH )); then

My whole corrected code, which is slightly faster than Firerat's one:

Code:

#!/bin/bash

if [[  $BASH_ARGC  !=  1  ]]; then
    echo "Usage: $0 data_file"
    exit 1
fi

if ! [[ -f "$1" ]]; then
    echo "Error: $1 not found"
    exit 1
fi

CWD=$(pwd)

tmpfile="/dev/shm/tmp.txt"
sort -n $1 > $tmpfile

tmpfile2="/dev/shm/tmp2.txt"
outliers="/dev/shm/outliers.txt"

IS_DATA_OK=0

IQRfilter () {

    rm $tmpfile2 2>/dev/null
    FOUND_OUTLIER=0
        
    rows=`wc -l $tmpfile | cut -d' ' -f1`

    q2=$(( ($rows+1)/2 ))
    q1=$(( $q2 / 2 ))
    q3=$(( 3 * $q1 ))

    Q1=`head -$q1 $tmpfile | tail -1`
    Q2=`head -$q2 $tmpfile | tail -1`
    Q3=`head -$q3 $tmpfile | tail -1`

    IQR=$(( $Q3-$Q1 ))

    IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc | cut -d'.' -f1`
    IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc | cut -d'.' -f1`
    
    while IFS= read -r line; do
    
    if [[ $FOUND_OUTLIER=0 ]]; then
    
        if (( $line < $IQR_X15LOW )) || (( $line > $IQR_X15HIGH )); then
            FOUND_OUTLIER=1
            IS_DATA_OK=0
            echo $line >> $outliers
        else
            echo $line >> $tmpfile2
        fi
        
    else
        echo $line >> $tmpfile2
    fi
        
    done < "$tmpfile"
    
}

while ((IS_DATA_OK!=1)); do
   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile2 $tmpfile
done

NAME=$(echo $1 | rev | cut -d'.' -f2 | rev)
mv $tmpfile $CWD/$NAME-filtered.dat

if [[ -f "$outliers" ]]; then
    mv $outliers $CWD/$NAME-outliers.dat
else
    echo "$1: No outlier found"
fi

exit 0