LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Speeding up a script to count number of repeat characters in each column (https://www.linuxquestions.org/questions/linux-newbie-8/speeding-up-a-script-to-count-number-of-repeat-characters-in-each-column-935322/)

tweed08 03-19-2012 12:58 PM

Speeding up a script to count number of repeat characters in each column
 
Hi!

Long time lurker, first time I haven't been able to easily search for my answer!

I have a text file in this format:

Code:

AAABCDBBCD...D
AAABDDBCCD...A
AAABCDACCD...B
AAABCCDBCD...C
AA--CCCBCD...-
AAA-CC---D...-

Where any character value can only be A,B,C,D or -

For each column (not row), I would like to calculate the highest number of repeat characters (A,B,C,D only).

An output for the above example would be:

Code:

6
6
5
4
5
3
2
3
5
6
...
1

I have written this very clunky script, but am unhappy with the speed.
Could anyone suggest a faster way of doing this?

Code:

# begin loop here from 1 to RowLength

        for (( n=1; n<=$RowLength; n++ ))

                do
                A=0
                B=0
                C=0
                D=0

                INPUT=`cut -c $n $TargetFile` # Cut input to a single character, starting column n

                A=$(echo $INPUT | tr -dc 'A' | wc -c) # count number of A,B,C,D in this column
                B=$(echo $INPUT | tr -dc 'B' | wc -c)
                C=$(echo $INPUT | tr -dc 'C' | wc -c)
                D=$(echo $INPUT | tr -dc 'D' | wc -c)

                ABCD=`echo -e "$A\n$B\n$C\n$D" | sort -n | tail -1`

                echo $ABCD
                done

Many thanks for any help!

grail 03-20-2012 03:29 AM

How about:
Code:

#!/usr/bin/awk -f

BEGIN{        FS = ""
        split("ABCD",letters)
}

{  for( i=1; i<=NF; i++ )
        count[i,$i]++
}

END{
    for( x=1; x<=NF; x++ )
    {
        out = 0
        for( y=1; y<=4; y++ )
            if( count[x,letters[y]] > out )
                out = count[x,letters[y]]
        print out
    }
}


tweed08 03-20-2012 04:19 PM

Thank you very much - that's much, much faster!

I get most of the code, but I don't understand this part - any chance of an explanation?

Code:

count[i,$i]++
Code:

if( count[x,letters[y]] > out )
Thanks again!

grail 03-21-2012 01:30 AM

count[i,$i]++ - Arrays in awk are associative by default, so this would equal in the first line for the 'A', count[1,"A"]++. The plus plus increases the value associated with this index by 1

if( count[x,letters[y]] > out ) - as per explanation above, this now asks us to retrieve what value this array index point to and compare with the value of 'out'. The 'letters' array is:
Code:

letters[1] = "A"
letters[2] = "B"
letters[3] = "C"
letters[4] = "D"

So again it is a check against 'out' which always starts at 0, so first iteration will be:
Code:

x=1
y=1
count[1, letters[1]] > 0

# which from above would be:

count[1, "A"] > 0

Finally, here is a good resource for awk that I use whenever stuck: http://www.gnu.org/software/gawk/man...ode/index.html

tweed08 03-21-2012 09:15 AM

Great - thank you again for help!


All times are GMT -5. The time now is 02:48 AM.