LinuxQuestions.org - [SOLVED] awk pattern matching between huge files?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - awk pattern matching between huge files? (https://www.linuxquestions.org/questions/programming-9/awk-pattern-matching-between-huge-files-944445/)

awk pattern matching between huge files?

Hi there,

I've been trawling the forums to try and find a solution to this problem, but haven't seen anything. I got an awesome response last time I posted here, so without trying to overstay my welcome, I was wondering if you could help again!

I have a large (~100 million line) file containing DNA sequencing data. The data is composed of 21 tab-delimited columns. I using the first and 10th to filter my data, but need the line in its entirety to run my analysis. The first column is an identifier (which comes as a pair), and the tenth column is 75 characters that make up the DNA. For the sake of space, here's an example of just those two columns for 4 records

HS7_194:8:1201:20342:162100 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
HS7_194:8:1201:20342:162100 ACCCTAACCCTGACCCAAACCCTAACCGTAACCCTAACCCTAACAAAAACCTTATTCTTAGCCCACCCCAACCCT
HS7_194:8:1206:18615:155222 TAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAG
HS7_194:8:1206:18615:155222 CCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCC

Now, I've written a script in bash to allow users to count certain patterns ($PATTERN). To do this I've been counting instances in awk, cheating a little by saving a bash variable called $AWK_REPEATER that allows me to select how many instance I want to look for ($NUMBER):

> AWK_REPEATER=`printf " && (sub (/"$PATTERN"/,\",\"))"%.s $(eval echo "{2..$NUMBER}")`
> awk '{print $10}' | awk '{while ((sub (/'"$PATTERN"'/,","))'"$AWK_REPEATER"') t++}END{print t}'

So if I wanted to look for lines that had 4x CCCTAA, I'd get:

> awk '{print $10}' | awk '{while ((sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,","))) t++}END{print t}'

This part works fine, and for the above 4 lines gives me 6 (3 sets of 4xCCCTAA in line 1, 1 set in line 2, 0 in line 3 and 2 in line 4). What I want to do now is to pull out those lines so I can look at them. This poses two problems; the first is that I have to print column 10 prior to the search (as it's possible that my pattern is present in one of the 20 other columns, and wouldn't want those counted). The second is that I've obviously told awk to substitute my patterns for commas, so the string in that column has been changed.

So I have two main questions;

1. Is it possible to specifically sub a column in awk? (something like '{while (sub[$10] (/PATTERN/,",")}' )

2. To convert the commas back to the pattern I was going to use sed:
sed 's/,/PATTERN/g'

but again, is there a way to specifically do this on the 10th column? There are definitely commas elsewhere on my line that I don't want to be converted. The only work around I can think of is to use a unique character/string when I do my sub, so I can sed it back without the worry of it interfering with my other columns.

Is this even a particularly efficient way of doing this? Is there a way of counting patterns within columns that doesn't convert the pattern to something else?

I even tried printing the identifier column ($1) and cross-referencing it to the original file to pull out those lines, but the files are too big and the ram gets crapped out.

Any help that you can offer would be greatly appreciated!!

Thanks,

Mark

Please use [CODE][/CODE] tags around the code and data, it makes it much easier to read:

Quote:

Originally Posted by rare_aquatic_badger (Post 4676148)

Code:

HS7_194:8:1201:20342:162100        ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

HS7_194:8:1201:20342:162100        ACCCTAACCCTGACCCAAACCCTAACCGTAACCCTAACCCTAACAAAAACCTTATTCTTAGCCCACCCCAACCCT

HS7_194:8:1206:18615:155222        TAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAG

HS7_194:8:1206:18615:155222        CCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCC

I would personally first convert the patterns to regular expressions. For example, four occurrences of CCCTAA can be simply written as

Code:

CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA

You can then use the pattern matching operator, string ~ pattern to check if a string contains the pattern.

Quote:

Originally Posted by rare_aquatic_badger (Post 4676148)

What I want to do now is to pull out those lines so I can look at them.

So check only the tenth field, $10 . Here is an example:

Code:

awk -v sequence=CCCTAA -v occurrences=4 '#

    BEGIN {

        # Accept any newline convention in input

        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



        # Data is separated by single TAB characters

        FS = "[\t]"



        # Construct the search pattern

        pattern = sequence

        for (i = 2; i <= occurrences; i++)

            pattern = pattern ".*" sequence

    }



    # Output all records (entire lines) with matching sequences

    ($10 ~ pattern) {

        printf("%s\n", $0)

    }

' input-file > output-file

Actually, I'd personally change the requirements a bit. If you describe the POSIX Basic Regular Expression (without backreferences) rules to your users, then you can allow any number of parallel expressions to capture the source data. The following one accepts one or more patterns, separated by whitespace, and outputs each line with the matching pattern appended to the line. If a line matches multiple patterns, it will be output multiple times.

Code:

awk -v patterns='CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA' '#

    BEGIN {

        # Accept any newline convention in input

        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



        # Data is separated by single TAB characters

        FS = "[\t]"



        # Construct the search pattern array

        gsub(/[\t\n\v\f\r ]+/, "\t", patterns)

        split(patterns, temp)

        for (i in temp)

            if (length(i) > 0)

                pattern[temp[i]] = temp[i]

    }



    {

        for (p in pattern)

            if ($10 ~ p)

                printf("%s\t%s\n", $0, p)

    }

' input-file > output-file

In both of the above scripts, you can use ^ to refer to the start of the sequence (so ^T only matches if the sequence starts with T), and $ to refer to the end of the sequence (so [GC]$ only matches if the sequence ends with G or C).

As your dataset is very large, you might only wish to limit the results to say ONLY four matches of CCCTAA. To do this, you need to first exclude extended matches -- those with five or more. Applied to the first script, this is

Code:

awk -v sequence=CCCTAA -v occurrences=4 '#

    BEGIN {

        # Accept any newline convention in input

        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



        # Data is separated by single TAB characters

        FS = "[\t]"



        # Construct the search pattern

        pattern = sequence

        for (i = 2; i <= occurrences; i++)

            pattern = pattern ".*" sequence



        # Construct the rejection pattern

        reject = pattern ".*" sequence

    }



    ($10 ~ reject) { next }



    ($10 ~ pattern) { printf("%s\n", $0) }

' input-file > output-file

In the general case, I'd use a separate rejection list that matches the patterns I don't want to see, e.g. FIVE matches of CCCTAA. Remember that any sequence with six or more occurrences will always have five occurrences! So, to eliminate the ones that have too many matches, you only need to reject those that have one more than you want.) This would be e.g.

Code:

awk -v accept='CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA' \

    -v reject='CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA' '#

    BEGIN {

        # Accept any newline convention in input

        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



        # Data is separated by single TAB characters

        FS = "[\t]"



        # Construct the search pattern array

        gsub(/[\t\n\v\f\r ]+/, "\t", accept)

        split(accept, temp)

        for (i in temp)

            if (length(temp[i]) > 0)

                pattern_accept[temp[i]] = temp[i]



        # Construct the reject pattern array

        gsub(/[\t\n\v\f\r ]+/, "\t", reject)

        split(reject, temp)

        for (i in temp)

            if (length(temp[i]) > 0)

                pattern_reject[temp[i]] = temp[i]

    }



    {

        for (p in pattern_reject)

            if ($10 ~ p)

                next



        for (p in pattern_accept)

            if ($10 ~ p)

                printf("%s\t%s\n", $0, p)

    }

' input-file > output-file

The maximum benefit from this type of awk approach is that awk variants stream the data, instead of reading it into memory. Typically, awk will only keep a few lines in memory (or a few tens of kilobytes for caching), whichever is larger.

Also note that if you have it installed, mawk is quite a bit faster for this than gawk. I recommend using

Code:

LANG=C LC_ALL=C mawk ...

if it works for you, to minimise overhead (you don't need locale support for the above scripts) and maximize speed.

I did have a bit of difficulty in parsing exactly what it is you wanted to do, instead of what you do, so I'd be very interested to know if this is anything like what you need.

Not sure I understand completely, but my thinking is, you want to print the matching lines and have a final count at the end?

Code:

awk -vlook="CCCTAA" -vnum=4 'n=split($2,_,look){t+=n-1}END{print int(t/num)}' file

Sorry for not using the CODE tags; I didn't realize I could do that...

Nominal Animal:

Going the regex route is an excellent idea! I did start with something similar, but at the time wanted to switch to count instances rather than lines. The difference between:

Code:

awk '{if ($10 ~ /CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA/) print $0}'

and

Code:

awk '{print $10}' | awk '{while ((sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,","))) t++}END{print t}'

can be seen on the first line example:

Code:

HS7_194:8:1201:20342:162100 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

The former will just count the line (=1)
The latter will count how many times the pattern is found (12 copies of CCCTAA, therefore for four repeats this line will be counted 3 times

Having said that, I think for the interests of speed, your solution is way better than what I was coming up with, and lines should be as quantitative as instances. Thanks!

I tried your code but got a syntax error at ($10 ~ pattern) bit. Was I doing something wrong? Anyway my workaround was to use a similar format to my previous code in my BASH script:

Code:

AWK_REPEATER=`printf ".*"$PATTERN_FWD""%.s $(eval echo "{2..$NUMBER}")`

awk '{if ($10 ~ /'"$PATTERN_FWD"''"$AWK_REPEATER"'/) print $0}'

and that seems to work fine now.

My program essentially asks the user to input a pattern and a number of instances at a prompt, then analyzes the data on ~100 files, each with ~100 million lines. The idea is if you have a protein that you believe binds DNA at a particular sequence, you can purify the DNA from that protein, and see if there's an enrichment in the number of lines/sequences containing that pattern. As such, I don't want to limit the analysis to an exact number of repeats. The data is stored as a binary file (so I have to pipe it through awk), and I convert back to binary, so the file sizes aren't too bad.

Thanks for your help!!

grail:

Sorry; I don't think I was explaining very well. I wanted to 1) count the number of times 'x' instances of a pattern occur in my data, and 2) save those complete lines to another file for further analysis. I should have come at it the other way round; create the file first, then just do a wc -l to count the lines, which is now what I'm doing. Thanks though!

Quote:

Originally Posted by rare_aquatic_badger (Post 4676324)

I wanted to 1) count the number of times 'x' instances of a pattern occur in my data, and 2) save those complete lines to another file for further analysis.

Well, I think this one does exactly that:

Code:

#!/usr/bin/mawk -f

BEGIN {

    RS = "[\v\f ]*(\r\n|\n\r|\r|\n)[\v\f ]*"

    FS = "[\v\f ]*[\t][\v\f ]*"



    if (ARGC < 3) {

        printf("\nUsage: %s PATTERN COUNT [ FILE(s)... ]\n\n", ARGV[0]) > "/dev/stderr"

        exit(1)

    }



    pattern = toupper(ARGV[1])

    if (length(pattern) < 1) {

        printf("%s: Invalid PATTERN.\n", ARGV[1]) > "/dev/stderr"

        exit(1)

    }



    count = int(ARGV[2])

    if (count < 1) {

        printf("%s: Invalid COUNT.\n", ARGV[2]) > "/dev/stderr"

        exit(1)

    }



    for (i = 3; i < ARGC; i++)

        ARGV[i-2] = ARGV[i]

    delete ARGV[--ARGC]

    delete ARGV[--ARGC]

}



($10 contains pattern) {

    hits = split($10, temp, pattern) - 1

    split("", temp)



    if (hits >= count)

        printf("%s\t%d\n", $0, hits)

}

This one takes two parameters: PATTERN and COUNT. You can also supply input file name or names. If none are specified, the script will read from standard input. For example:

Code:

xzcat data.xz | ./this.awk CCCTAA 4 > save

will write all lines where CCCTAA occurs four or more times in the tenth field, with the number of occurrences added to the end of the lines, to file save.

I shamelessly stole the counting mechanism from Grail's code above. The idea is that splitting the field using the desired pattern will always yield (occurrences+1) elements; since split() returns the number of elements, the number of occurrences of PATTERN is split(string,array,PATTERN)-1. Very nice, Grail!

Quote:

Originally Posted by Nominal Animal

Very nice, Grail!

Well I figure it must almost be my turn as you always have very comprehensive answers :)

Quote:

Originally Posted by rare_aquatic_badger

Sorry; I don't think I was explaining very well. I wanted to 1) count the number of times 'x' instances of a pattern occur in my data, and 2) save those complete lines to another file for further analysis. I should have come at it the other way round; create the file first, then just do a wc -l to count the lines, which is now what I'm doing. Thanks though!

So when you ran my code (assuming you gave it a go on a small sample), did you not get what you wanted?

As you have asked for additional (wc -l) detail, this is rather simple too:

Code:

awk -vlook="CCCTAA" -vnum=4 'n=split($2,_,look){t+=n-1;count++;print > "save"}END{print "pattern appeared",int(t/num),"times on",count"lines"}' file

Great solution! Thanks guys!

grail:

Quote:

So when you ran my code (assuming you gave it a go on a small sample), did you not get what you wanted?

I wasn't able get your code working. It generated an identical file to the one I started off with (my files are binary and need to first be parsed by a program called samtools).

Code:

samtools view FILE | wc -l

22868464 FILE



wc -l save 

22868464 save

It does print appear to be counting a pattern (I changed $2 to $10 as the string I want to pattern match is the 10th column):

Code:

samtools view FILE | awk -vlook="CCCTAA" -vnum=4 'n=split($10,_,look){t+=n-1;count++;print > "save"}END{print "pattern appeared",int(t/num),"times on",count"lines"}'

"pattern appeared 77843 times on 22868464lines"

If I run what I was originally doing I get a different value though:

Code:

samtools view FILE | awk '{print $10}' | awk '{while ((sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,","))) t++}END{print t}'

gives me 3001 hits where the pattern matches. I can't work out where this discrepancy happens. Could you explain what the 'split' portion is doing? I've not used that in an awk one-liner before.

Nominal Animal:

This is a small function of a much larger shell script. I've used getopts to specify the user options, and then a simple

Code:

echo -en "-> Please input sequence motif to search:  "

read PATTERN

echo -en "-> Please input number of repeats:  "

read NUMBER

for the user to input the motif and number of instances. If I run your script:

Code:

samtools view FILE | ./nascript.awk CCCTAA 4 > save



wc -l save

1386

I get 1386 matches, which I think is because it's counting lines rather than instances (if there are 12xCCCTAA on a single line, using my awk script will count that line 3 times, while this counts it only 1 time). So this is functionally equivalent to your earlier solution; CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA

is there any advantage of one way over the other?

Thanks again to both of you for all your help!

Could you clarify a bit, please?

You have PATTERN, CCCTAA in the example, and a minimum count COUNT, 4 in the example.

You want the awk script to output only those lines where PATTERN occurs COUNT or more times in the tenth field. This seems clear.

But, what count do you want in addition? The total number of occurrences of PATTERN in all the tenth fields, including those not output? The total number of occurrences of PATTERN in the files that are output (i.e. have at least COUNT occurrences)? The total number of lines where PATTERN has occurred? The total number of lines output?

Your last message indicates you want to count only sets of COUNT occurrences; what about rounding? What if you have 2COUNT-1 occurrences on a line; is that counted as one or two sets?

Assuming you want the total number of occurrences in the tenth fields that are output:

Code:

read -p 'Sequence motif to search: ' PATTERN

read -p 'Number of occurrences: ' NUMBER



if RESULTS=($( samtools view FILE | mawk -v pattern="$PATTERN" -v count="$NUMBER" '#

    BEGIN {

        RS = "[\v\f ]*(\r\n|\n\r|\r|\n)[\v\f ]*"

        FS = "[\v\f ]*[\t][\v\f ]*"

        if (length(pattern) < 1) {

            printf("%s: Invalid PATTERN.\n", pattern)

            exit(1)

        }

        pattern = toupper(pattern)

        if (int(count) < 1) {

            printf("%s: Invalid COUNT.\n", count)

            exit(1)

        }

        count = int(count)

    }

    ($10 contains pattern) {

        hits = split($10, temp, pattern) - 1

        split("", temp)

        if (hits >= count) {

            printf("%s\t%d\n", $0, hits) > "/dev/stderr"

            total += hits

            lines++

            if (hits > max)

                max = hits

        }

    }

    END {

        printf("%d %d %d\n", lines, total, max)

    }' "$PATTERN" "$NUMBER" 2>SAVE ) ) ; then



    # Got a result.

    echo "Found ${TOTAL[1]} hits total, in ${TOTAL[0]} input records; up to ${TOTAL[2]} hits in a single record."



else



    # Invalid PATTERN or NUMBER. Reconstruct the error message.

    echo "${RESULTS[*]}" >&2



fi

Again, you can use awk or gawk instead of mawk; mawk is just faster. The results should be exactly the same.

The above snippet only considers records with at least NUMBER occurrences, but it will count individual occurrences, not sets of NUMBER occurrences.

It would not be difficult at all to create a histogram of the number of hits, if you wanted to draw a bar graph of that. (The number of occurrences in the horizontal axis, the number of records in the vertical axis.)

At this point, the main problem is to understand what information, exactly, you want out of your data.

Finally, if you use this a lot, it would be certainly a lot faster to do this all with a native program. If your sequences contain only four bases, you could keep up to 92 base pairs per record, variable length, with location of the corresponding line in a text file, in 32 bytes. That would mean the binary index for 100 million records would be a tad over 3 gigabytes, small enough to fit in RAM on a good workstation; the searches then would be near-instantaneous. It's the I/O that limits your data rate. If you're seriously interested, I think I could whip up a program with similar interface to the scriptlets above, with the pattern and count queries built-in to the program.

Quote:

Could you explain what the 'split' portion is doing?

Actually NA already explained it in post #6:

Quote:

The idea is that splitting the field using the desired pattern will always yield (occurrences+1) elements; since split() returns the number of elements, the number of occurrences of PATTERN is split(string,array,PATTERN)-1.

You can look up more on split here