awk pattern matching between huge files?
Hi there,
I've been trawling the forums to try and find a solution to this problem, but haven't seen anything. I got an awesome response last time I posted here, so without trying to overstay my welcome, I was wondering if you could help again! I have a large (~100 million line) file containing DNA sequencing data. The data is composed of 21 tab-delimited columns. I using the first and 10th to filter my data, but need the line in its entirety to run my analysis. The first column is an identifier (which comes as a pair), and the tenth column is 75 characters that make up the DNA. For the sake of space, here's an example of just those two columns for 4 records HS7_194:8:1201:20342:162100 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC HS7_194:8:1201:20342:162100 ACCCTAACCCTGACCCAAACCCTAACCGTAACCCTAACCCTAACAAAAACCTTATTCTTAGCCCACCCCAACCCT HS7_194:8:1206:18615:155222 TAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAGGGGGTTAGGGTTAGGGTTAG HS7_194:8:1206:18615:155222 CCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCC Now, I've written a script in bash to allow users to count certain patterns ($PATTERN). To do this I've been counting instances in awk, cheating a little by saving a bash variable called $AWK_REPEATER that allows me to select how many instance I want to look for ($NUMBER): > AWK_REPEATER=`printf " && (sub (/"$PATTERN"/,\",\"))"%.s $(eval echo "{2..$NUMBER}")` > awk '{print $10}' | awk '{while ((sub (/'"$PATTERN"'/,","))'"$AWK_REPEATER"') t++}END{print t}' So if I wanted to look for lines that had 4x CCCTAA, I'd get: > awk '{print $10}' | awk '{while ((sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,","))) t++}END{print t}' This part works fine, and for the above 4 lines gives me 6 (3 sets of 4xCCCTAA in line 1, 1 set in line 2, 0 in line 3 and 2 in line 4). What I want to do now is to pull out those lines so I can look at them. This poses two problems; the first is that I have to print column 10 prior to the search (as it's possible that my pattern is present in one of the 20 other columns, and wouldn't want those counted). The second is that I've obviously told awk to substitute my patterns for commas, so the string in that column has been changed. So I have two main questions; 1. Is it possible to specifically sub a column in awk? (something like '{while (sub[$10] (/PATTERN/,",")}' ) 2. To convert the commas back to the pattern I was going to use sed: sed 's/,/PATTERN/g' but again, is there a way to specifically do this on the 10th column? There are definitely commas elsewhere on my line that I don't want to be converted. The only work around I can think of is to use a unique character/string when I do my sub, so I can sed it back without the worry of it interfering with my other columns. Is this even a particularly efficient way of doing this? Is there a way of counting patterns within columns that doesn't convert the pattern to something else? I even tried printing the identifier column ($1) and cross-referencing it to the original file to pull out those lines, but the files are too big and the ram gets crapped out. Any help that you can offer would be greatly appreciated!! Thanks, Mark |
Please use [CODE][/CODE] tags around the code and data, it makes it much easier to read:
Quote:
Code:
CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA Quote:
Code:
awk -v sequence=CCCTAA -v occurrences=4 '# Code:
awk -v patterns='CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA' '# As your dataset is very large, you might only wish to limit the results to say ONLY four matches of CCCTAA. To do this, you need to first exclude extended matches -- those with five or more. Applied to the first script, this is Code:
awk -v sequence=CCCTAA -v occurrences=4 '# Code:
awk -v accept='CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA' \ Also note that if you have it installed, mawk is quite a bit faster for this than gawk. I recommend using Code:
LANG=C LC_ALL=C mawk ... I did have a bit of difficulty in parsing exactly what it is you wanted to do, instead of what you do, so I'd be very interested to know if this is anything like what you need. |
Not sure I understand completely, but my thinking is, you want to print the matching lines and have a final count at the end?
Code:
awk -vlook="CCCTAA" -vnum=4 'n=split($2,_,look){t+=n-1}END{print int(t/num)}' file |
Sorry for not using the CODE tags; I didn't realize I could do that...
Nominal Animal: Going the regex route is an excellent idea! I did start with something similar, but at the time wanted to switch to count instances rather than lines. The difference between: Code:
awk '{if ($10 ~ /CCCTAA.*CCCTAA.*CCCTAA.*CCCTAA/) print $0}' Code:
awk '{print $10}' | awk '{while ((sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,","))) t++}END{print t}' Code:
HS7_194:8:1201:20342:162100 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC The latter will count how many times the pattern is found (12 copies of CCCTAA, therefore for four repeats this line will be counted 3 times Having said that, I think for the interests of speed, your solution is way better than what I was coming up with, and lines should be as quantitative as instances. Thanks! I tried your code but got a syntax error at ($10 ~ pattern) bit. Was I doing something wrong? Anyway my workaround was to use a similar format to my previous code in my BASH script: Code:
AWK_REPEATER=`printf ".*"$PATTERN_FWD""%.s $(eval echo "{2..$NUMBER}")` My program essentially asks the user to input a pattern and a number of instances at a prompt, then analyzes the data on ~100 files, each with ~100 million lines. The idea is if you have a protein that you believe binds DNA at a particular sequence, you can purify the DNA from that protein, and see if there's an enrichment in the number of lines/sequences containing that pattern. As such, I don't want to limit the analysis to an exact number of repeats. The data is stored as a binary file (so I have to pipe it through awk), and I convert back to binary, so the file sizes aren't too bad. Thanks for your help!! grail: Sorry; I don't think I was explaining very well. I wanted to 1) count the number of times 'x' instances of a pattern occur in my data, and 2) save those complete lines to another file for further analysis. I should have come at it the other way round; create the file first, then just do a wc -l to count the lines, which is now what I'm doing. Thanks though! |
Quote:
Code:
#!/usr/bin/mawk -f Code:
xzcat data.xz | ./this.awk CCCTAA 4 > save I shamelessly stole the counting mechanism from Grail's code above. The idea is that splitting the field using the desired pattern will always yield (occurrences+1) elements; since split() returns the number of elements, the number of occurrences of PATTERN is split(string,array,PATTERN)-1. Very nice, Grail! |
Quote:
Quote:
As you have asked for additional (wc -l) detail, this is rather simple too: Code:
awk -vlook="CCCTAA" -vnum=4 'n=split($2,_,look){t+=n-1;count++;print > "save"}END{print "pattern appeared",int(t/num),"times on",count"lines"}' file |
Great solution! Thanks guys!
grail: Quote:
Code:
samtools view FILE | wc -l Code:
samtools view FILE | awk -vlook="CCCTAA" -vnum=4 'n=split($10,_,look){t+=n-1;count++;print > "save"}END{print "pattern appeared",int(t/num),"times on",count"lines"}' If I run what I was originally doing I get a different value though: Code:
samtools view FILE | awk '{print $10}' | awk '{while ((sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,",")) && (sub (/CCCTAA/,","))) t++}END{print t}' Nominal Animal: This is a small function of a much larger shell script. I've used getopts to specify the user options, and then a simple Code:
echo -en "-> Please input sequence motif to search: " Code:
samtools view FILE | ./nascript.awk CCCTAA 4 > save is there any advantage of one way over the other? Thanks again to both of you for all your help! |
Could you clarify a bit, please?
You have PATTERN, CCCTAA in the example, and a minimum count COUNT, 4 in the example. You want the awk script to output only those lines where PATTERN occurs COUNT or more times in the tenth field. This seems clear. But, what count do you want in addition? The total number of occurrences of PATTERN in all the tenth fields, including those not output? The total number of occurrences of PATTERN in the files that are output (i.e. have at least COUNT occurrences)? The total number of lines where PATTERN has occurred? The total number of lines output? Your last message indicates you want to count only sets of COUNT occurrences; what about rounding? What if you have 2COUNT-1 occurrences on a line; is that counted as one or two sets? Assuming you want the total number of occurrences in the tenth fields that are output: Code:
read -p 'Sequence motif to search: ' PATTERN The above snippet only considers records with at least NUMBER occurrences, but it will count individual occurrences, not sets of NUMBER occurrences. It would not be difficult at all to create a histogram of the number of hits, if you wanted to draw a bar graph of that. (The number of occurrences in the horizontal axis, the number of records in the vertical axis.) At this point, the main problem is to understand what information, exactly, you want out of your data. Finally, if you use this a lot, it would be certainly a lot faster to do this all with a native program. If your sequences contain only four bases, you could keep up to 92 base pairs per record, variable length, with location of the corresponding line in a text file, in 32 bytes. That would mean the binary index for 100 million records would be a tad over 3 gigabytes, small enough to fit in RAM on a good workstation; the searches then would be near-instantaneous. It's the I/O that limits your data rate. If you're seriously interested, I think I could whip up a program with similar interface to the scriptlets above, with the pattern and count queries built-in to the program. |
Quote:
Quote:
|
All times are GMT -5. The time now is 08:37 AM. |