LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-25-2019, 03:23 PM   #1
Linux.tar.gz
Senior Member
 
Registered: Dec 2003
Location: Paris
Distribution: Slackware forever.
Posts: 2,426

Rep: Reputation: 95
Post Shell script to remove outliers with IQR technique


Hi,

I wrote an highly unoptimized script that reads a one-column file containing values as the one and only argument.
As far as I've tested it, it does the job correctly, but slowly.

IQR stands for InterQuartile Range:
https://en.wikipedia.org/wiki/Interquartile_range

The IQR outliers detection is explained here:
https://en.wikipedia.org/wiki/Interq...range#Outliers

The IQR outliers removal is explained here:
https://www.mathworks.com/matlabcent...quartile-range

I guess that there's a gazillion ways to improve it, that's why I'm posting it.
Also, because I didn't find a bash script doing this.

The script:
Code:
#!/bin/bash

filename="/dev/shm/tmp.txt"
sort -n $1 > $filename

tmpfile="/dev/shm/tmp2.txt"

IS_DATA_OK=0

IQRfilter () {

    rows=`wc -l $filename | cut -d' ' -f1`

    q2=`echo "($rows+1)/2" | bc`
    q1=`echo "$q2 / 2" | bc`
    q3=`echo "3 * $q1" | bc`

    Q1=`head -$q1 $filename | tail -1`
    #echo $Q1

    Q2=`head -$q2 $filename | tail -1`
    #echo $Q2

    Q3=`head -$q3 $filename | tail -1`
    #echo $Q3

    IQR=`echo "$Q3-$Q1" | bc`
    #echo $IQR

    IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
    #echo $IQR_X15LOW

    IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`
    #echo $IQR_X15HIGH
    
    while IFS= read -r line; do
        if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
        sed "0,/$line/{/$line/d;}" $filename > $tmpfile
        IS_DATA_OK=0
        break
        fi
    done < "$filename"
    
}

while ((IS_DATA_OK!=1))
do

   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile $filename

done

mv $filename ./output.txt

rm /dev/shm/tmp.txt /dev/shm/tmp2.txt

exit 0
A sample data file is attached.
Attached Files
File Type: txt input.txt (39.1 KB, 9 views)

Last edited by Linux.tar.gz; 09-26-2019 at 02:37 PM.
 
Old 09-25-2019, 10:40 PM   #2
michaelk
Moderator
 
Registered: Aug 2002
Posts: 19,028

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Code:
    while IFS= read -r line; do
        if [[ $line < $IQR_X15LOW || $line > $IQR_X15HIGH ]]; then
        sed "0,/$line/{/$line/d;}" $filename > $tmpfile
        IS_DATA_OK=0
        break
        fi
    done < "$filename"
I didn't try timing but you can eliminate the while loop, calling your function multiple times and using sed by just writing the matching numbers.
Code:
    while IFS= read -r line; do
        if [[ $line > $IQR_X15LOW && $line < $IQR_X15HIGH ]]; then
        echo "$line" >> $tmpfile
        fi
    done < "$filename"

Last edited by michaelk; 09-25-2019 at 10:42 PM.
 
1 members found this post helpful.
Old 09-25-2019, 11:56 PM   #3
rigor
Member
 
Registered: Sep 2011
Posts: 282

Rep: Reputation: Disabled
You can go even further by holding the file contents in an array, to avoid processing the file repeatedly.

I've included a simple example bash script which adds a list of simple integer values which it expects will be in a file whose name is passed as a command line argument, and deletes two numbers which it expects to be provided as the second and third command line arguments. It then adds the remaining numbers, then calculates the integer value of the average. It's just a simple example of how to use an array to reprocess a list of numbers, without having to reprocess a file external to the script time and again. I preferred to keep it simple just in case you weren't familiar with the bash features I was using, and because I found a study of statistics only up to the level of such things as standard deviation and two tailed hypothesis testing, was good enough for me. If I wanted to learn about mid hinge stuff, I'd study mechanics.

calc_fast.bash:
Code:
#!/bin/bash


debug=1

if [[  $debug  -eq  1  ]]
    then
        set +x ;
fi

BOOBOO=-99

if [[  $BASH_ARGC  !=  3  ]]
    then
        echo "Usage:  $0  data_file_name  first_num_to_delete  second_num_to_delete"
        
        exit $BOOBOO
fi

if [  -r  $1  ]
    then
        data_file_name="${1}"
        
        if [[  $debug  -eq  1  ]]
            then
                echo "data_file_name=${data_file_name}"
        fi

    else
        exit $BOOBOO
fi


first_num_to_delete=$2
second_num_to_delete=$3

declare -a  numbers

        
if [[  $debug  -eq  1  ]]
    then
        cat ${data_file_name}
fi

# It seems that readarray and mapfile aren't working for me.
# So let's use a simple minded loop, instead. :-O

while [[  0  -eq  0  ]]
    do
        read value ;

        if [[  $debug  -eq  1  ]]
            then
                echo 'value='${value} ;
        fi

       
        if [[  ${value}  ==  ""  ]]
            then
                break ;
        fi
 
        count=${#numbers[@]} ;

        if [[  $debug  -eq  1  ]]
            then
                echo 'count='${count} ;
        fi

        numbers[$count]=${value} ;
    done  <  ${data_file_name}


if [[  $debug  -eq  1  ]]
    then
        echo 'Arrays numbers='${numbers[@]}
fi

for  index_num in "${numbers[@]}"
    do
        if [[  $debug  -eq  1  ]]
            then
                echo "index_num=$index_num"
        fi

   
        if [[  ${numbers[$index_num]}  -eq  $first_num_to_delete  ]]
            then
                unset numbers[$index_num] ;
        fi
 
        if [[  ${numbers[$index_num]}  -eq  $second_num_to_delete  ]]
            then
                unset numbers[$index_num] ;
        fi
    done

sum=0

if [[  $debug  -eq  1  ]]
    then
        echo 'Count of numbers='${#numbers[*]}
fi


for  value in "${numbers[@]}"
    do
        if [[  $debug  -eq  1  ]]
            then
                echo $value
        fi

        sum=$(( sum + value  )) ;
    done

echo "Sum=$sum,average=$(($sum/${#numbers[*]}))"
There is an equally simple data file attached, which I used to make sure the script works ala:
Code:
./calc_fast.bash input_numbers.txt 9 10 |& less
Naturally the calculations you are doing could be expected to take a few times as long as what this simple script is doing. But to give you an idea of the potential improvement in processing speed if not repeatedly reprocessing a file external to the script, when I fed this script your data file, the script took between 2 and 10 seconds to handle it on my machine, depending on what else was happening on the machine. My machine is NOT brand new, it's technology from a few years ago.

HTH.
Attached Files
File Type: txt input_numbers.txt (21 Bytes, 1 views)

Last edited by rigor; 09-26-2019 at 04:06 AM.
 
1 members found this post helpful.
Old 09-26-2019, 01:19 AM   #4
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 13,070

Rep: Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132Reputation: 4132
in general:
1. bash is not really good at mathematics
2. if you still want to calculate use $(( expression ))
3. avoid backticks - and $( ) if possible because they are slow
4. keep data in variables, not in files

so actually I would try to read the content into an array and work on that. All the head, tail and bc calls can be replaced and use some built-in functions. (probably that sed can be replaced too).
 
1 members found this post helpful.
Old 09-26-2019, 01:47 AM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,463
Blog Entries: 9

Rep: Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372Reputation: 3372
...more in general:
  • use bash builtins instead of external commands: even a complex shuffling of strings and variables is MUCH faster than something like "head -$q1 $filename | tail -1"
  • every pipe '|', every command substitution (backticks or $()) creates a sub-shell and is to be avoided in an often-repeated function
 
1 members found this post helpful.
Old 09-26-2019, 06:47 AM   #6
michaelk
Moderator
 
Registered: Aug 2002
Posts: 19,028

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
In addition the outliers in your sorted data are always < Q1 and > Q3. You can create loops just to check this range versus looking at the entire file.
 
1 members found this post helpful.
Old 09-26-2019, 02:39 PM   #7
Linux.tar.gz
Senior Member
 
Registered: Dec 2003
Location: Paris
Distribution: Slackware forever.
Posts: 2,426

Original Poster
Rep: Reputation: 95
Thanks all !
I'm testing the codes.

I edited a bit the first post because I didn't mention the outlier removal procedure, which is the purpose of the script.
A loop is needed for the removal procedure.
 
Old 09-26-2019, 03:28 PM   #8
Linux.tar.gz
Senior Member
 
Registered: Dec 2003
Location: Paris
Distribution: Slackware forever.
Posts: 2,426

Original Poster
Rep: Reputation: 95
This one goes so much faster !
0,778s vs 46,127s

Code:
#!/bin/bash

tmpfile="/dev/shm/tmp.txt"
sort -n $1 > $tmpfile

tmpfile2="/dev/shm/tmp2.txt"

IS_DATA_OK=0

IQRfilter () {

    rm $tmpfile2 2>/dev/null
    FOUND_OUTLIER=0
        
    rows=`wc -l $tmpfile | cut -d' ' -f1`

    q2=$(( ($rows+1)/2 ))
    q1=$(( $q2 / 2 ))
    q3=$(( 3 * $q1 ))

    Q1=`head -$q1 $tmpfile | tail -1`
    Q2=`head -$q2 $tmpfile | tail -1`
    Q3=`head -$q3 $tmpfile | tail -1`

    IQR=$(( $Q3-$Q1 ))

    IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
    IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc`

    while IFS= read -r line; do
    
    if [[ $FOUND_OUTLIER=0 ]]; then
    
        if [[ ($line < $IQR_X15LOW || $line > $IQR_X15HIGH) ]]; then
            FOUND_OUTLIER=1
            IS_DATA_OK=0
        else
            echo $line >> $tmpfile2
        fi
        
    else
        echo $line >> $tmpfile2
    fi
        
    done < "$tmpfile"
    
}

while ((IS_DATA_OK!=1)); do
   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile2 $tmpfile
done

mv $tmpfile ./output.txt

exit 0
I can't use
Code:
$(( ))
for
Code:
IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc`
Probably because of the float.

Last edited by Linux.tar.gz; 09-26-2019 at 03:34 PM.
 
Old 09-26-2019, 03:41 PM   #9
michaelk
Moderator
 
Registered: Aug 2002
Posts: 19,028

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Code:
while ((IS_DATA_OK!=1)); do
   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile2 $tmpfile
done
The while loop is not necessary. All you need to do is call your IQR filter function.
Code:
    while IFS= read -r line; do
        
        if [[ ($line > $IQR_X15LOW && $line < $IQR_X15HIGH) ]]; then
            echo $line >> $tmpfile2
        fi               
    done < "$tmpfile"
This loops through your data once and eliminates the outliers. My time was 0.557s. Not a big improvement but more efficient.

Last edited by michaelk; 09-26-2019 at 05:48 PM. Reason: timing
 
1 members found this post helpful.
Old 09-26-2019, 06:24 PM   #10
Linux.tar.gz
Senior Member
 
Registered: Dec 2003
Location: Paris
Distribution: Slackware forever.
Posts: 2,426

Original Poster
Rep: Reputation: 95
But the loop is needed !
Quote:
To determine whether data contains an outlier:
1 Identify the point furthest from the mean of the data.
2 Determine whether that point is further than 1.5*IQR away from the mean.
3 If so, that point is an outlier and should be eliminated from the data resulting in a new set of data.
4 Repeat steps to determine if new data set contains an outlier until dataset no longer contains outlier.
https://www.mathworks.com/matlabcent...quartile-range
 
Old 09-26-2019, 06:44 PM   #11
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,320

Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
much faster

Code:
#!/bin/bash
IQRfilter () {
Sorted=( ${Sorted[@]} )
IQR=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} - ${Sorted[ ${#Sorted[@]} / 4 ]} ))
IQR_X15LOW=$(( ${Sorted[ ${#Sorted[@]} / 4 ]} - $(( 3 * $IQR / 2 )) ))
IQR_X15HIGH=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} + $(( 3 * $IQR / 2 )) ))
}
CheckOL () {
check=${#Sorted[@]}
for i in ${!Sorted[@]}
do
  [[ ${Sorted[i]} -lt $IQR_X15LOW || ${Sorted[i]} -gt $IQR_X15HIGH ]] \
    && unset Sorted[$i] 
done
set +x
[[ ${#Sorted[@]} == ${check} ]] && return 0
return 1
}
Output () {
for i in "${Sorted[@]}"
do
    echo $i 
done
}
trap Output EXIT
Sorted=( $(sort -n "$1") )
while :
do
    IQRfilter
    CheckOL && ((c++))
    [[ $c -gt 1 ]] && exit
done
exit
 
1 members found this post helpful.
Old 09-26-2019, 06:55 PM   #12
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,320

Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
I forgot to move my check


change last bit to
Code:
while :
do
    IQRfilter
    CheckOL || continue
exit
done
won't make that much difference
 
1 members found this post helpful.
Old 09-26-2019, 08:01 PM   #13
michaelk
Moderator
 
Registered: Aug 2002
Posts: 19,028

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Ok, I got it...
 
1 members found this post helpful.
Old 09-27-2019, 01:18 PM   #14
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,320

Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
faster still

Code:
#!/bin/bash
IQRfilter () {
Sorted=( ${Sorted[@]} )
IQR=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} - ${Sorted[ ${#Sorted[@]} / 4 ]} ))
IQR_X15LOW=$(( ${Sorted[ ${#Sorted[@]} / 4 ]} - $(( 3 * $IQR / 2 )) ))
IQR_X15HIGH=$(( ${Sorted[ 3 * ${#Sorted[@]} / 4 ]} + $(( 3 * $IQR / 2 )) ))
check=${#Sorted[@]}
while :
do
    [[ ${Sorted[0]} -lt $IQR_X15LOW ]] \
        && unset Sorted[0] || break
done
while :
do
    [[ ${Sorted[(-1)]} -gt $IQR_X15HIGH ]] \
        && unset Sorted[\(-1\)] || break
done
[[ ${#Sorted[@]} == ${check} ]] && return 0
return 1
}
Output () {
for i in "${Sorted[@]}"
do
    echo $i 
done
}
trap Output EXIT
Sorted=( $(sort -n "$1") )
while :
do
    IQRfilter || continue
exit
done
 
1 members found this post helpful.
Old 09-28-2019, 09:23 PM   #15
Linux.tar.gz
Senior Member
 
Registered: Dec 2003
Location: Paris
Distribution: Slackware forever.
Posts: 2,426

Original Poster
Rep: Reputation: 95
Thanks, I'll check your code !

For some reason, the
Code:
if [[ ($line < $IQR_X15LOW || $line > $IQR_X15HIGH) ]]; then
can't filter all outliers.
I had to use
Code:
if (( $line < $IQR_X15LOW )) || (( $line > $IQR_X15HIGH )); then
My whole corrected code, which is slightly faster than Firerat's one:
Code:
#!/bin/bash

if [[  $BASH_ARGC  !=  1  ]]; then
    echo "Usage: $0 data_file"
    exit 1
fi

if ! [[ -f "$1" ]]; then
    echo "Error: $1 not found"
    exit 1
fi

CWD=$(pwd)

tmpfile="/dev/shm/tmp.txt"
sort -n $1 > $tmpfile

tmpfile2="/dev/shm/tmp2.txt"
outliers="/dev/shm/outliers.txt"

IS_DATA_OK=0

IQRfilter () {

    rm $tmpfile2 2>/dev/null
    FOUND_OUTLIER=0
        
    rows=`wc -l $tmpfile | cut -d' ' -f1`

    q2=$(( ($rows+1)/2 ))
    q1=$(( $q2 / 2 ))
    q3=$(( 3 * $q1 ))

    Q1=`head -$q1 $tmpfile | tail -1`
    Q2=`head -$q2 $tmpfile | tail -1`
    Q3=`head -$q3 $tmpfile | tail -1`

    IQR=$(( $Q3-$Q1 ))

    IQR_X15LOW=`echo "$Q1-($IQR*1.5)" | bc | cut -d'.' -f1`
    IQR_X15HIGH=`echo "$Q3+($IQR*1.5)" | bc | cut -d'.' -f1`
    
    while IFS= read -r line; do
    
    if [[ $FOUND_OUTLIER=0 ]]; then
    
        if (( $line < $IQR_X15LOW )) || (( $line > $IQR_X15HIGH )); then
            FOUND_OUTLIER=1
            IS_DATA_OK=0
            echo $line >> $outliers
        else
            echo $line >> $tmpfile2
        fi
        
    else
        echo $line >> $tmpfile2
    fi
        
    done < "$tmpfile"
    
}

while ((IS_DATA_OK!=1)); do
   IS_DATA_OK=1
   IQRfilter
   mv $tmpfile2 $tmpfile
done

NAME=$(echo $1 | rev | cut -d'.' -f2 | rev)
mv $tmpfile $CWD/$NAME-filtered.dat

if [[ -f "$outliers" ]]; then
    mv $outliers $CWD/$NAME-outliers.dat
else
    echo "$1: No outlier found"
fi

exit 0

Last edited by Linux.tar.gz; 09-29-2019 at 05:07 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Forum related question - Please help with specific search technique Jongi LQ Suggestions & Feedback 8 01-05-2006 11:07 AM
A question related to DVD burning technique babyboss Linux - Hardware 8 08-15-2005 04:19 AM
Back up technique for one-time re-install rickh Fedora 3 07-07-2005 04:49 PM
Symlink or Mount or Another Technique? jalperin Linux - Newbie 1 08-26-2004 06:43 PM
Advanced GREP technique ? ganninu Linux - General 2 10-08-2003 06:55 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:52 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration