Can you help speed up my script? (cut/awk one liner)
Hi,
Im trying to learn quicker ways to run these commands as they are taking a LONG time with my huge genotype files. Ive tried using the merge file method without luck. I am calling the file2 column number as a variable from file1. PHP Code:
:) Lel |
Maybe there are possibilities to re-arrange the program flow...
I mean, snp can take 800000 different values ? If no, maybe it will be more efficient to store all different snp values in an array in a first pass |
Quote:
Code:
awk '{print "cut -f1,"$2" -d\\ genefile.text"}' snpnumbers.txt | sh |
Looks like you've got file1 snpnumbers.txt with 800,000 recs and you want the 2nd field value (X) from each rec.
You then want to output the 1st & X'th value from file2 genefile.text, which is just 1 very(!) long rec with space separated field vals. If that's correct (& even if it isn't) I'd definitely use Perl. It's ideal for that and in fact a lot of Perl is used in bio-related work http://www.bioperl.org/wiki/Main_Page HTH http://perldoc.perl.org/ http://www.perlmonks.org/?node=Tutorials http://www.tizag.com/perlT/index.php |
Thanks jthill: Your method is nice and easy but it still takes a few minutes to run 1 snp. I have 2500 to loop through! This is why im trying to avoid using cut.
I like the idea of calling snp in an array but im having a little trouble executing it. Im not overly familiar with while loops! PHP Code:
|
Again could we see some data before and after? Only has to be a few lines of each file and its associated output.
|
Sorry of course you can,
genefile.txt (2500 lines) A00000020 2 1 2 0 2 -> n=800,000 A00000022 2 2 2 2 2 ... A00000027 2 2 2 1 2 ... A00000029 1 2 2 2 2 ... snpnumbers.txt (2000 lines) BovineHD1000008269 270834 10 25336507 A C T G BovineHD1000008270 270835 10 25337462 A G A G BovineHD1000008288 270836 10 25396995 A G A G BovineHD1000008294 270837 10 25404716 A G A G BovineHD1000008297 270838 10 25421854 A G T C So the first line will print column 270835 of genefile In the script i will also print a 'phenotype score' column that matches to the ID. phen.txt A00001975 2118 12.2593 A00001978 1734 17.8009 A00002186 1527 28.8607 A00000279 5620 18.449 A00000327 9368 26.8168 A00002002 1367 10.6606 The output I need is $ID, $GENOTYPE, $PHENOTYPE A00000020 2 29.4022 A00000022 1 23.3695 A00000027 1 6.0783 A00000029 0 16.3565 A00000378 0 15.7392 |
Hmmmm, but this 'snpnumbers.txt (2000 lines)' (your post #7) does NOT match up with 'snpnumbers.txt with 800,000 recs' (my post #4) even though in your post#5 you say it does.
Also '29.4022' in your wanted o/p does not appear in any of your other 'example' data.... Can you check post #7 and provide correct actual example? |
Yes this data I have provided is definitely accurate, i was just trying to provide a simple example above which is why i left out phenotype, but yes that for loop should be to 2000.
I can do this part.. ##Take snp number from pathway and writes to array. Loop through array But am stuck on the while loop... ## For each animal ID print ($ID $geno file 1) and ($3 (phenotype) file 2) awk '{ a[NR]=$2 } END { for(i=1;i<=NR;i++) print a[i] }' tmp/snpdata | while read a; #works do awk 'FNR==NR{ID[$1]=$1;next}(ID[$1]) {print $1,$3,$ID[$1]}' OFS='\t' genefile.txt phen.txt; done #doesnt work |
Okay, the extra info helps. Genefile.txt has 2500 ~1.5MB records, you're generating what looks like tagged permutations of each, generating about 40G of output. You're starting to get into data-processing territory here.
I'm assuming you don't have hundreds of gigs RAM to play with. This code adds additional fields to ease resequencing to match your original. It also sorts phen.txt and puts a ":" between the ID and GENOTYPE in phen.txt to make `join` usable. Code:
$ awk '{print $1":"$2" "$3;}' phen.txt \ |
Hmmm ... I am still missing something :(
Let me see if this is what you are saying? 1. Based on second column of snpnumbers.txt you want to retrieve this column + 1 from genefile.txt 2. Using the first column of genefile.txt, from line being used in step 1, you wish to match this with the first column in phen.txt and return column ??? (I do not know here as none of the output matches with any from this file in your output data) 3. Form output of the form id, genotype, phenotype If the above is correct then are we simply looking for, based on current examples, output of 2000 lines, ie the number of lines in snpnumbers.txt? |
Grail, Your points 1 and 3 are right, but the output will have 2500 lines (1 for each animal ID). The genotype line is the column of snpnumbers.txt which will change each time I loop through.
As for the phenotype file you mentioned in point 2. Any column is fine, for the example 3. In reality I have >20 columns. |
Hi jthill, thanks again for your help. I do have quite a bit or RAM to use. 100GB is fine, 500GB is pushing it. I tried this script you suggested but it took 32 seconds to run one SNP.
PHP Code:
|
I think I will just have to accept that this will take a long time to run. Ive tried a while loop and even a perl script.
It seems paste is the quickest. Thanks anyway for your help i really appreciate it :) |
You're welcome. Yeah, at this point you're down to about a one-day job to get it done, right? And you're going to be I/O bound soon if not already, not much benefit trying to improve it unless you'll be doing this repeatedly. It definitely could be improved if so.
Hey, try one thing: pipe the output through gzip and see if that helps. If you're I/O bound I expect it will help a lot, 'cause you're sure to have an idle core or three lying around to run the compression. |
All times are GMT -5. The time now is 04:37 AM. |