LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-18-2012, 11:28 PM   #1
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Rep: Reputation: Disabled
looping through a variable with awk


Hi,

For each line in random.txt I want to subset the girls from girls.txt which fit the criteria (age between min and max and same hair colour). Here is an example of my file structure:

random.txt

Min_age Max_age Hair_colour
14 15 Brown
9 11 Red
5 7 Blonde
2 3 Brown
89 91 Red

And girls.txt

Age Haircolour
10 Brown
11 Brown
3 Red
2 Blonde
90 Red

I have a script that will do what I want to but its not quick enough. This loop takes 30 seconds to run.. but i have to run it 10,000 times.

PHP Code:
for i in `seq 1 84`
    do
  
min=`head -$i tmp/random.txt | tail -1 | awk '{print $1-500000}'`
  
max=`head -$i tmp/random.txt | tail -1 | awk '{print $2+500000}'`
  
colour=`head -$i tmp/random.txt | tail -1 | awk '{print $3}'`
  
awk '{ if($3=='$colour' && $4>'$min' && $4<'$max') print $0}' girls.txt >> tmp/filter.txt
done 
I have tried to only call the random.txt file once...

PHP Code:
for i in `seq 1 84`
     do   
awk '{if(NR=="'$i'") print $1-500000,$2+500000,$3}' tmp/random.txt >> info.txt
done
min
=`awk '{print $1}' info.txt`
max=`awk '{print $2}' info.txt`
colour=`awk '{print $3}' info.txt
But i am unsure how to then choose each variable from "min,max,colour"

I hope this makes sense and I would appreciate any help. Cheers.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 01-19-2012, 02:28 AM   #2
rodrifra
Member
 
Registered: Mar 2007
Location: Spain
Distribution: Debian
Posts: 201

Rep: Reputation: 36
I don't understand why you cicle from 1 to 84 you can get those lines within awk not using a for loop. In the second pice of code you can remove the for loop and do a
Code:
awk '{if(NR<=84) print $1-500000,$2+500000,$3}' tmp/random.txt >> info.txt
I don't understand why you check with awk columns that doesn't exist in the file (girls only has 2 columns and you are checking columns 3 and 4.

So... what is exactly what you want to get?

Last edited by rodrifra; 01-19-2012 at 03:03 AM.
 
Old 01-19-2012, 06:04 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,550

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
I too would query why not just do the whole exercise in awk?
 
Old 01-19-2012, 03:38 PM   #4
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Thanks for your help ridrifra. I actually dont have any computational background Im a biologist using next gen sequencing data so this is all fairly new to me. This is also why the data example doesnt match my script. There is actually more to the file.

I am trying to subset the 'girls.txt' file for data that matches the conditions described in each line of 'random.txt'. Or for my actual data, I have a massive file with snp position and chromosome (Girls.txt) and a file with gene start and stop positions and chromosome number (random.txt). So I want the SNPs that fall within the gene positions. So I need to call all 3 variables in info.txt. I simply need this script to run faster.

PHP Code:
for i in `seq 1 84
    do 
  
min=`head -$i tmp/random.txt | tail -1 | awk '{print $1-500000}'
  
max=`head -$i tmp/random.txt | tail -1 | awk '{print $2+500000}'
  
colour=`head -$i tmp/random.txt | tail -1 | awk '{print $3}'
  
awk '{ if($3=='$colour' && $4>'$min' && $4<'$max') print $0}' girls.txt >> tmp/filter.txt 
done 

Last edited by uni_lel7; 01-19-2012 at 03:40 PM.
 
Old 01-19-2012, 04:58 PM   #5
devUnix
Member
 
Registered: Oct 2010
Location: Bengaluru, India
Distribution: RHEL 5.1 on My PC, & SunOS / Sun Solaris, RHEL, SuSe, Debian, FreeBSD and other Linux flavors @ Work
Posts: 584

Rep: Reputation: 59
Hi,


It is getting 3:30 AM now and I have to leave for office at 8 AM. So, I will try to help you in the morning. You have mentioned above "There is actually more to the file.", so if your data look different from what you have given as an example, then can I request you to post some sample / example records from your data files? It can help to help you get the exact result you want.

Good Night... to me at least!
 
Old 01-19-2012, 05:39 PM   #6
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Thanks devUnix, here is an example of my true data...


800k_map.txt - animal ID, SNP number, chromosome, SNP position BP
015 1 1 36337
026 2 1 78655
027 3 1 83412
873 4 1 135098
038 5 1 137548
042 6 1 149772
043 7 1 151060
044 8 1 152374
066 9 1 155938
048 10 1 158820

info.txt - gene start and stop positions and chromosome number
693633 1700624 14
68133477 69195567 7
116916542 118154254 2
84662663 85761167 8
134769869 135771305 1
54753845 55866806 19
7143401 8153963 11
73052335 74087964 6
42189093 43196591 15
62817581 63861259 4

PHP Code:
cat 800k_map.txt | while read a;
do
chr=`echo $a | gawk '{print $3}'`
pos=`echo $a | gawk '{print $4}'`
cat info.txt gawk -v c="$chr-v p="$pos"  '$3==c && $1<=p && $2>=p {print c,p}'
done 
This script uses the 'a' method suggested above but still does not work...
 
Old 01-19-2012, 11:39 PM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,550

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
So if I am following the new information, you wish to have an output of chromosome and SNP position BP where the third field in info.txt has the same chromosome and the SNP position BP
are between the gene start and stop positions.

Assuming the above is correct, may I ask which file is the smaller? Reason for this question is that the solution I would present in awk would require reading the first
file into variables to then be checked in the second (and of course storing the smaller file would be quicker and less memory intensive).

To give you an idea based on what you have shown:
Code:
awk 'FNR==NR{low[$3]=$1;high[$3]=$2;next}$4 >= low[$3] && $4 <= high[$3]{print $3,$4}' info.txt 800k_map.txt
A few things to note:

1. Based on supposition at the top means that the current data presented will yield no results with this script

2. I presumed both files have no headers in them (ie just the data)

3. Second file was read first as in 800k_map.txt the third field is currently the same for all values in the fourth field

Please let me know if any of this is unclear or if I went off on the wrong track of what you needed?
 
2 members found this post helpful.
Old 01-22-2012, 05:49 PM   #8
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
WOW! It works!! You've cut the running time down from about 600 hours to 16 hours!
Ive been trying to figure this out for days. Would you mind explaining why this FNR=NR method works so much quicker and what [$3] specifies?

You've been so helpful Thank you so much!

Last edited by uni_lel7; 01-22-2012 at 10:06 PM.
 
Old 01-23-2012, 12:20 AM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,550

Rep: Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898Reputation: 2898
FNR = The current record number in the current file. FNR is incremented each time a new record is read (see Records). It is reinitialized to zero each time a new input file is started.
NR = The number of input records awk has processed since the beginning of the program's execution (see Records). NR is incremented each time a new record is read.
$X = Where 'X' corresponds to the field created by the field separator, hence in our example $3 would refer to the third field
[] = Square brackets are used to denote an array, hence low[] is an array and $3 is being used as the array index

See here for a full awk manual
 
1 members found this post helpful.
Old 01-23-2012, 04:18 PM   #10
uni_lel7
LQ Newbie
 
Registered: Jan 2012
Location: Melbourne
Posts: 12

Original Poster
Rep: Reputation: Disabled
Fantastic, Thanks again!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
AWK looping though fields casperdaghost Linux - Newbie 10 12-31-2011 10:31 AM
problem while comparing awk field variable with input variable entered using keyboard vinay007 Programming 12 08-23-2011 01:44 AM
[SOLVED] Looping using while read line, using a variable instead of a file. henrtm05 Programming 3 09-25-2010 12:49 AM
[SOLVED] [awk] looping description problem dhodho Programming 7 07-26-2010 11:18 PM
Help with looping more than 1 variable. hdoyle Linux - Newbie 4 01-22-2009 06:28 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration