LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Can you help speed up my script? (cut/awk one liner) (https://www.linuxquestions.org/questions/linux-newbie-8/can-you-help-speed-up-my-script-cut-awk-one-liner-925527/)

uni_lel7 01-24-2012 05:23 PM

Can you help speed up my script? (cut/awk one liner)
 
Hi,

Im trying to learn quicker ways to run these commands as they are taking a LONG time with my huge genotype files.

Ive tried using the merge file method without luck. I am calling the file2 column number as a variable from file1.

PHP Code:

for i in `seq 1 800000`
do
snp=`head -$i snpnumbers.txt | tail -1 | awk '{print $2}'`

cut -f1,$snp -' ' genefile.txt 
....
done

#OR the second option

awk '{print $1,$snp}' genefile.txt
....
done 

Thanks in advance for any suggestions.
:) Lel

Cedrik 01-24-2012 06:26 PM

Maybe there are possibilities to re-arrange the program flow...

I mean, snp can take 800000 different values ?

If no, maybe it will be more efficient to store all different snp values in an array in a first pass

jthill 01-24-2012 07:04 PM

Quote:

Originally Posted by uni_lel7 (Post 4583528)
Hi,
PHP Code:

for i in `seq 1 800000`
  do
    
snp=`head -$i snpnumbers.txt | tail -1 | awk '{print $2}'`
    
cut -f1,$snp -' ' genefile.txt 
    
....
  
done 


Yow. That's going to hurt.

Code:

awk '{print "cut -f1,"$2" -d\\  genefile.text"}' snpnumbers.txt | sh
will cut out an awful lot of the inefficiency, if the result is still dreadful there's lots more to get out of it. Does snpnumbers.txt select one field or multiple fields? It looks like field one might be a sort key, is the result going to be sorted on that?

chrism01 01-24-2012 07:42 PM

Looks like you've got file1 snpnumbers.txt with 800,000 recs and you want the 2nd field value (X) from each rec.
You then want to output the 1st & X'th value from file2 genefile.text, which is just 1 very(!) long rec with space separated field vals.
If that's correct (& even if it isn't) I'd definitely use Perl.
It's ideal for that and in fact a lot of Perl is used in bio-related work http://www.bioperl.org/wiki/Main_Page

HTH
http://perldoc.perl.org/
http://www.perlmonks.org/?node=Tutorials
http://www.tizag.com/perlT/index.php

uni_lel7 01-24-2012 09:15 PM

Thanks jthill: Your method is nice and easy but it still takes a few minutes to run 1 snp. I have 2500 to loop through! This is why im trying to avoid using cut.

I like the idea of calling snp in an array but im having a little trouble executing it.
Im not overly familiar with while loops!

PHP Code:

awk '{ a[NR]=$2 }  END { for(i=1;i<=NR;i++) print a[i] }' snpnumbers.txt 
| while read a; do awk '{print $1,$i}' genefile.txtdone 

But yes chrism01 - That is exactly what im trying to do. Perhaps I will have to have a try of perl.

grail 01-24-2012 09:55 PM

Again could we see some data before and after? Only has to be a few lines of each file and its associated output.

uni_lel7 01-24-2012 10:17 PM

Sorry of course you can,

genefile.txt (2500 lines)

A00000020 2 1 2 0 2 -> n=800,000
A00000022 2 2 2 2 2 ...
A00000027 2 2 2 1 2 ...
A00000029 1 2 2 2 2 ...

snpnumbers.txt (2000 lines)
BovineHD1000008269 270834 10 25336507 A C T G
BovineHD1000008270 270835 10 25337462 A G A G
BovineHD1000008288 270836 10 25396995 A G A G
BovineHD1000008294 270837 10 25404716 A G A G
BovineHD1000008297 270838 10 25421854 A G T C


So the first line will print column 270835 of genefile
In the script i will also print a 'phenotype score' column that matches to the ID.

phen.txt
A00001975 2118 12.2593
A00001978 1734 17.8009
A00002186 1527 28.8607
A00000279 5620 18.449
A00000327 9368 26.8168
A00002002 1367 10.6606

The output I need is $ID, $GENOTYPE, $PHENOTYPE
A00000020 2 29.4022
A00000022 1 23.3695
A00000027 1 6.0783
A00000029 0 16.3565
A00000378 0 15.7392

chrism01 01-24-2012 10:30 PM

Hmmmm, but this 'snpnumbers.txt (2000 lines)' (your post #7) does NOT match up with 'snpnumbers.txt with 800,000 recs' (my post #4) even though in your post#5 you say it does.
Also '29.4022' in your wanted o/p does not appear in any of your other 'example' data....

Can you check post #7 and provide correct actual example?

uni_lel7 01-24-2012 10:37 PM

Yes this data I have provided is definitely accurate, i was just trying to provide a simple example above which is why i left out phenotype, but yes that for loop should be to 2000.

I can do this part..
##Take snp number from pathway and writes to array. Loop through array

But am stuck on the while loop...
## For each animal ID print ($ID $geno file 1) and ($3 (phenotype) file 2)

awk '{ a[NR]=$2 } END { for(i=1;i<=NR;i++) print a[i] }' tmp/snpdata | while read a;
#works
do awk 'FNR==NR{ID[$1]=$1;next}(ID[$1]) {print $1,$3,$ID[$1]}' OFS='\t' genefile.txt phen.txt; done
#doesnt work

jthill 01-25-2012 01:22 AM

Okay, the extra info helps. Genefile.txt has 2500 ~1.5MB records, you're generating what looks like tagged permutations of each, generating about 40G of output. You're starting to get into data-processing territory here.

I'm assuming you don't have hundreds of gigs RAM to play with. This code adds additional fields to ease resequencing to match your original. It also sorts phen.txt and puts a ":" between the ID and GENOTYPE in phen.txt to make `join` usable.

Code:

$ awk '{print $1":"$2" "$3;}' phen.txt \
    | sort > phen-processed.txt
$ awk '
  ARGIND==1 { snp[nrsnps++]=$2; next; }

  {
          for ( i = 0 ; i < nrsnps ; ++i )
                  print $1":"$snp[i],FNR,i+1;
  }
' snpnumbers.txt genefile.txt | sort | join phen-processed.txt - >payload.txt
$


grail 01-25-2012 03:08 AM

Hmmm ... I am still missing something :(

Let me see if this is what you are saying?

1. Based on second column of snpnumbers.txt you want to retrieve this column + 1 from genefile.txt

2. Using the first column of genefile.txt, from line being used in step 1, you wish to match this with the first column in phen.txt and return column ??? (I do not know here as none of the output
matches with any from this file in your output data)

3. Form output of the form id, genotype, phenotype

If the above is correct then are we simply looking for, based on current examples, output of 2000 lines, ie the number of lines in snpnumbers.txt?

uni_lel7 01-26-2012 05:27 PM

Grail, Your points 1 and 3 are right, but the output will have 2500 lines (1 for each animal ID). The genotype line is the column of snpnumbers.txt which will change each time I loop through.

As for the phenotype file you mentioned in point 2. Any column is fine, for the example 3. In reality I have >20 columns.

uni_lel7 01-26-2012 06:19 PM

Hi jthill, thanks again for your help. I do have quite a bit or RAM to use. 100GB is fine, 500GB is pushing it. I tried this script you suggested but it took 32 seconds to run one SNP.

PHP Code:

time awk 'ARGIND==1 { snp[nrsnps++]=$2; next; }{ for ( i = 0 ; i < 1 ; ++i ) print $1, $snp[i]}' snpnumbers.txt genefile.txt >tmpfile.txt 


uni_lel7 01-26-2012 07:45 PM

I think I will just have to accept that this will take a long time to run. Ive tried a while loop and even a perl script.
It seems paste is the quickest. Thanks anyway for your help i really appreciate it :)

jthill 01-26-2012 09:19 PM

You're welcome. Yeah, at this point you're down to about a one-day job to get it done, right? And you're going to be I/O bound soon if not already, not much benefit trying to improve it unless you'll be doing this repeatedly. It definitely could be improved if so.

Hey, try one thing: pipe the output through gzip and see if that helps. If you're I/O bound I expect it will help a lot, 'cause you're sure to have an idle core or three lying around to run the compression.


All times are GMT -5. The time now is 04:37 AM.