LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-24-2012, 06:23 PM   #1
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Rep: Reputation: Disabled
Can you help speed up my script? (cut/awk one liner)


Hi,

Im trying to learn quicker ways to run these commands as they are taking a LONG time with my huge genotype files.

Ive tried using the merge file method without luck. I am calling the file2 column number as a variable from file1.

PHP Code:
for i in `seq 1 800000`
do
snp=`head -$i snpnumbers.txt | tail -1 | awk '{print $2}'`

cut -f1,$snp -' ' genefile.txt 
....
done

#OR the second option

awk '{print $1,$snp}' genefile.txt
....
done 
Thanks in advance for any suggestions.
Lel
 
Old 01-24-2012, 07:26 PM   #2
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 243Reputation: 243Reputation: 243
Maybe there are possibilities to re-arrange the program flow...

I mean, snp can take 800000 different values ?

If no, maybe it will be more efficient to store all different snp values in an array in a first pass
 
Old 01-24-2012, 08:04 PM   #3
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
Quote:
Originally Posted by uni_lel7 View Post
Hi,
PHP Code:
for i in `seq 1 800000`
  do
    
snp=`head -$i snpnumbers.txt | tail -1 | awk '{print $2}'`
    
cut -f1,$snp -' ' genefile.txt 
    
....
  
done 
Yow. That's going to hurt.

Code:
awk '{print "cut -f1,"$2" -d\\  genefile.text"}' snpnumbers.txt | sh
will cut out an awful lot of the inefficiency, if the result is still dreadful there's lots more to get out of it. Does snpnumbers.txt select one field or multiple fields? It looks like field one might be a sort key, is the result going to be sorted on that?

Last edited by jthill; 01-24-2012 at 08:08 PM. Reason: Grrr. quoting.
 
Old 01-24-2012, 08:42 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,417

Rep: Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397
Looks like you've got file1 snpnumbers.txt with 800,000 recs and you want the 2nd field value (X) from each rec.
You then want to output the 1st & X'th value from file2 genefile.text, which is just 1 very(!) long rec with space separated field vals.
If that's correct (& even if it isn't) I'd definitely use Perl.
It's ideal for that and in fact a lot of Perl is used in bio-related work http://www.bioperl.org/wiki/Main_Page

HTH
http://perldoc.perl.org/
http://www.perlmonks.org/?node=Tutorials
http://www.tizag.com/perlT/index.php
 
Old 01-24-2012, 10:15 PM   #5
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Thanks jthill: Your method is nice and easy but it still takes a few minutes to run 1 snp. I have 2500 to loop through! This is why im trying to avoid using cut.

I like the idea of calling snp in an array but im having a little trouble executing it.
Im not overly familiar with while loops!

PHP Code:
awk '{ a[NR]=$2 }  END { for(i=1;i<=NR;i++) print a[i] }' snpnumbers.txt 
| while read a; do awk '{print $1,$i}' genefile.txtdone 
But yes chrism01 - That is exactly what im trying to do. Perhaps I will have to have a try of perl.

Last edited by uni_lel7; 01-24-2012 at 10:17 PM.
 
Old 01-24-2012, 10:55 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,564

Rep: Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901
Again could we see some data before and after? Only has to be a few lines of each file and its associated output.
 
Old 01-24-2012, 11:17 PM   #7
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Sorry of course you can,

genefile.txt (2500 lines)

A00000020 2 1 2 0 2 -> n=800,000
A00000022 2 2 2 2 2 ...
A00000027 2 2 2 1 2 ...
A00000029 1 2 2 2 2 ...

snpnumbers.txt (2000 lines)
BovineHD1000008269 270834 10 25336507 A C T G
BovineHD1000008270 270835 10 25337462 A G A G
BovineHD1000008288 270836 10 25396995 A G A G
BovineHD1000008294 270837 10 25404716 A G A G
BovineHD1000008297 270838 10 25421854 A G T C


So the first line will print column 270835 of genefile
In the script i will also print a 'phenotype score' column that matches to the ID.

phen.txt
A00001975 2118 12.2593
A00001978 1734 17.8009
A00002186 1527 28.8607
A00000279 5620 18.449
A00000327 9368 26.8168
A00002002 1367 10.6606

The output I need is $ID, $GENOTYPE, $PHENOTYPE
A00000020 2 29.4022
A00000022 1 23.3695
A00000027 1 6.0783
A00000029 0 16.3565
A00000378 0 15.7392
 
Old 01-24-2012, 11:30 PM   #8
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,417

Rep: Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397
Hmmmm, but this 'snpnumbers.txt (2000 lines)' (your post #7) does NOT match up with 'snpnumbers.txt with 800,000 recs' (my post #4) even though in your post#5 you say it does.
Also '29.4022' in your wanted o/p does not appear in any of your other 'example' data....

Can you check post #7 and provide correct actual example?
 
Old 01-24-2012, 11:37 PM   #9
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Yes this data I have provided is definitely accurate, i was just trying to provide a simple example above which is why i left out phenotype, but yes that for loop should be to 2000.

I can do this part..
##Take snp number from pathway and writes to array. Loop through array

But am stuck on the while loop...
## For each animal ID print ($ID $geno file 1) and ($3 (phenotype) file 2)

awk '{ a[NR]=$2 } END { for(i=1;i<=NR;i++) print a[i] }' tmp/snpdata | while read a;
#works
do awk 'FNR==NR{ID[$1]=$1;next}(ID[$1]) {print $1,$3,$ID[$1]}' OFS='\t' genefile.txt phen.txt; done
#doesnt work
 
Old 01-25-2012, 02:22 AM   #10
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
Okay, the extra info helps. Genefile.txt has 2500 ~1.5MB records, you're generating what looks like tagged permutations of each, generating about 40G of output. You're starting to get into data-processing territory here.

I'm assuming you don't have hundreds of gigs RAM to play with. This code adds additional fields to ease resequencing to match your original. It also sorts phen.txt and puts a ":" between the ID and GENOTYPE in phen.txt to make `join` usable.

Code:
$ awk '{print $1":"$2" "$3;}' phen.txt \
    | sort > phen-processed.txt
$ awk ' 
  ARGIND==1 { snp[nrsnps++]=$2; next; }

  { 
          for ( i = 0 ; i < nrsnps ; ++i )
                  print $1":"$snp[i],FNR,i+1;
  }
' snpnumbers.txt genefile.txt | sort | join phen-processed.txt - >payload.txt
$

Last edited by jthill; 01-25-2012 at 02:27 AM.
 
Old 01-25-2012, 04:08 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,564

Rep: Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901
Hmmm ... I am still missing something

Let me see if this is what you are saying?

1. Based on second column of snpnumbers.txt you want to retrieve this column + 1 from genefile.txt

2. Using the first column of genefile.txt, from line being used in step 1, you wish to match this with the first column in phen.txt and return column ??? (I do not know here as none of the output
matches with any from this file in your output data)

3. Form output of the form id, genotype, phenotype

If the above is correct then are we simply looking for, based on current examples, output of 2000 lines, ie the number of lines in snpnumbers.txt?
 
Old 01-26-2012, 06:27 PM   #12
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Grail, Your points 1 and 3 are right, but the output will have 2500 lines (1 for each animal ID). The genotype line is the column of snpnumbers.txt which will change each time I loop through.

As for the phenotype file you mentioned in point 2. Any column is fine, for the example 3. In reality I have >20 columns.
 
Old 01-26-2012, 07:19 PM   #13
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Hi jthill, thanks again for your help. I do have quite a bit or RAM to use. 100GB is fine, 500GB is pushing it. I tried this script you suggested but it took 32 seconds to run one SNP.

PHP Code:
time awk 'ARGIND==1 { snp[nrsnps++]=$2; next; }{ for ( i = 0 ; i < 1 ; ++i ) print $1, $snp[i]}' snpnumbers.txt genefile.txt >tmpfile.txt 
 
Old 01-26-2012, 08:45 PM   #14
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
I think I will just have to accept that this will take a long time to run. Ive tried a while loop and even a perl script.
It seems paste is the quickest. Thanks anyway for your help i really appreciate it
 
Old 01-26-2012, 10:19 PM   #15
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
You're welcome. Yeah, at this point you're down to about a one-day job to get it done, right? And you're going to be I/O bound soon if not already, not much benefit trying to improve it unless you'll be doing this repeatedly. It definitely could be improved if so.

Hey, try one thing: pipe the output through gzip and see if that helps. If you're I/O bound I expect it will help a lot, 'cause you're sure to have an idle core or three lying around to run the compression.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk nested if, command liner patolfo Programming 37 05-14-2010 01:19 AM
paste, cut one liner conversation from bash to csh bioinformatics_guy Linux - Newbie 2 09-02-2009 11:36 AM
awk one liner help niknak Linux - Newbie 1 05-07-2009 05:52 AM
How to use command grep,cut,awk to cut a data from a file? hocheetiong Linux - Newbie 7 09-11-2008 08:16 PM
cut / awk command?? Sammy2ooo Linux - Newbie 1 05-27-2003 06:46 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 02:31 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration