LinuxQuestions.org - [SOLVED] Speeding up a script to count number of repeat characters in each column

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Speeding up a script to count number of repeat characters in each column (https://www.linuxquestions.org/questions/linux-newbie-8/speeding-up-a-script-to-count-number-of-repeat-characters-in-each-column-935322/)

Speeding up a script to count number of repeat characters in each column

Hi!

Long time lurker, first time I haven't been able to easily search for my answer!

I have a text file in this format:

Code:

AAABCDBBCD...D

AAABDDBCCD...A

AAABCDACCD...B

AAABCCDBCD...C

AA--CCCBCD...-

AAA-CC---D...-

Where any character value can only be A,B,C,D or -

For each column (not row), I would like to calculate the highest number of repeat characters (A,B,C,D only).

An output for the above example would be:

Code:

I have written this very clunky script, but am unhappy with the speed.
Could anyone suggest a faster way of doing this?

Code:

# begin loop here from 1 to RowLength



        for (( n=1; n<=$RowLength; n++ ))



                do

                A=0

                B=0

                C=0

                D=0



                INPUT=`cut -c $n $TargetFile` # Cut input to a single character, starting column n



                A=$(echo $INPUT | tr -dc 'A' | wc -c) # count number of A,B,C,D in this column

                B=$(echo $INPUT | tr -dc 'B' | wc -c)

                C=$(echo $INPUT | tr -dc 'C' | wc -c)

                D=$(echo $INPUT | tr -dc 'D' | wc -c)



                ABCD=`echo -e "$A\n$B\n$C\n$D" | sort -n | tail -1`



                echo $ABCD

                done

Many thanks for any help!

How about:

Code:

#!/usr/bin/awk -f



BEGIN{        FS = ""

        split("ABCD",letters)

}



{  for( i=1; i<=NF; i++ )

        count[i,$i]++

}



END{

    for( x=1; x<=NF; x++ )

    {

        out = 0

        for( y=1; y<=4; y++ )

            if( count[x,letters[y]] > out )

                out = count[x,letters[y]]

        print out

    }

}

Thank you very much - that's much, much faster!

I get most of the code, but I don't understand this part - any chance of an explanation?

Code:

count[i,$i]++

Code:

if( count[x,letters[y]] > out )

Thanks again!

count[i,$i]++ - Arrays in awk are associative by default, so this would equal in the first line for the 'A', count[1,"A"]++. The plus plus increases the value associated with this index by 1

if( count[x,letters[y]] > out ) - as per explanation above, this now asks us to retrieve what value this array index point to and compare with the value of 'out'. The 'letters' array is:

Code:

letters[1] = "A"

letters[2] = "B"

letters[3] = "C"

letters[4] = "D"

So again it is a check against 'out' which always starts at 0, so first iteration will be:

Code:

x=1

y=1

count[1, letters[1]] > 0



# which from above would be:



count[1, "A"] > 0

Finally, here is a good resource for awk that I use whenever stuck: http://www.gnu.org/software/gawk/man...ode/index.html

Great - thank you again for help!