[SOLVED] looping through a variable with awk

uni_lel7 · 01-18-2012, 10:28 PM

Hi,

For each line in random.txt I want to subset the girls from girls.txt which fit the criteria (age between min and max and same hair colour). Here is an example of my file structure:

random.txt

Min_age Max_age Hair_colour
14 15 Brown
9 11 Red
5 7 Blonde
2 3 Brown
89 91 Red

And girls.txt

Age Haircolour
10 Brown
11 Brown
3 Red
2 Blonde
90 Red

I have a script that will do what I want to but its not quick enough. This loop takes 30 seconds to run.. but i have to run it 10,000 times.

PHP Code:



for i in `seq 1 84` 
    do 
  min=`head -$i tmp/random.txt | tail -1 | awk '{print $1-500000}'` 
  max=`head -$i tmp/random.txt | tail -1 | awk '{print $2+500000}'` 
  colour=`head -$i tmp/random.txt | tail -1 | awk '{print $3}'` 
  awk '{ if($3=='$colour' && $4>'$min' && $4<'$max') print $0}' girls.txt >> tmp/filter.txt 
done

I have tried to only call the random.txt file once...

PHP Code:



for i in `seq 1 84` 
     do   awk '{if(NR=="'$i'") print $1-500000,$2+500000,$3}' tmp/random.txt >> info.txt 
done 
min=`awk '{print $1}' info.txt` 
max=`awk '{print $2}' info.txt` 
colour=`awk '{print $3}' info.txt`

But i am unsure how to then choose each variable from "min,max,colour"

I hope this makes sense and I would appreciate any help. Cheers.

rodrifra · 01-19-2012, 01:28 AM

I don't understand why you cicle from 1 to 84 you can get those lines within awk not using a for loop. In the second pice of code you can remove the for loop and do a

Code:

awk '{if(NR<=84) print $1-500000,$2+500000,$3}' tmp/random.txt >> info.txt

I don't understand why you check with awk columns that doesn't exist in the file (girls only has 2 columns and you are checking columns 3 and 4.

So... what is exactly what you want to get?

grail · 01-19-2012, 05:04 AM

I too would query why not just do the whole exercise in awk?

uni_lel7 · 01-19-2012, 02:38 PM

Thanks for your help ridrifra. I actually dont have any computational background Im a biologist using next gen sequencing data so this is all fairly new to me. This is also why the data example doesnt match my script. There is actually more to the file.

I am trying to subset the 'girls.txt' file for data that matches the conditions described in each line of 'random.txt'. Or for my actual data, I have a massive file with snp position and chromosome (Girls.txt) and a file with gene start and stop positions and chromosome number (random.txt). So I want the SNPs that fall within the gene positions. So I need to call all 3 variables in info.txt. I simply need this script to run faster.

PHP Code:



for i in `seq 1 84`  
    do  
  min=`head -$i tmp/random.txt | tail -1 | awk '{print $1-500000}'`  
  max=`head -$i tmp/random.txt | tail -1 | awk '{print $2+500000}'`  
  colour=`head -$i tmp/random.txt | tail -1 | awk '{print $3}'`  
  awk '{ if($3=='$colour' && $4>'$min' && $4<'$max') print $0}' girls.txt >> tmp/filter.txt  
done

devUnix · 01-19-2012, 03:58 PM

Hi,

It is getting 3:30 AM now and I have to leave for office at 8 AM. So, I will try to help you in the morning. You have mentioned above "There is actually more to the file.", so if your data look different from what you have given as an example, then can I request you to post some sample / example records from your data files? It can help to help you get the exact result you want.

Good Night... to me at least!

uni_lel7 · 01-19-2012, 04:39 PM

Thanks devUnix, here is an example of my true data...

800k_map.txt - animal ID, SNP number, chromosome, SNP position BP
015 1 1 36337
026 2 1 78655
027 3 1 83412
873 4 1 135098
038 5 1 137548
042 6 1 149772
043 7 1 151060
044 8 1 152374
066 9 1 155938
048 10 1 158820

info.txt - gene start and stop positions and chromosome number
693633 1700624 14
68133477 69195567 7
116916542 118154254 2
84662663 85761167 8
134769869 135771305 1
54753845 55866806 19
7143401 8153963 11
73052335 74087964 6
42189093 43196591 15
62817581 63861259 4

PHP Code:



cat 800k_map.txt | while read a; 
do 
chr=`echo $a | gawk '{print $3}'` 
pos=`echo $a | gawk '{print $4}'` 
cat info.txt | gawk -v c="$chr" -v p="$pos"  '$3==c && $1<=p && $2>=p {print c,p}' 
done

This script uses the 'a' method suggested above but still does not work...

grail · 01-19-2012, 10:39 PM

So if I am following the new information, you wish to have an output of chromosome and SNP position BP where the third field in info.txt has the same chromosome and the SNP position BP
are between the gene start and stop positions.

Assuming the above is correct, may I ask which file is the smaller? Reason for this question is that the solution I would present in awk would require reading the first
file into variables to then be checked in the second (and of course storing the smaller file would be quicker and less memory intensive).

To give you an idea based on what you have shown:

Code:

awk 'FNR==NR{low[$3]=$1;high[$3]=$2;next}$4 >= low[$3] && $4 <= high[$3]{print $3,$4}' info.txt 800k_map.txt

A few things to note:

1. Based on supposition at the top means that the current data presented will yield no results with this script

2. I presumed both files have no headers in them (ie just the data)

3. Second file was read first as in 800k_map.txt the third field is currently the same for all values in the fourth field

Please let me know if any of this is unclear or if I went off on the wrong track of what you needed?

uni_lel7 · 01-22-2012, 04:49 PM

WOW! It works!! You've cut the running time down from about 600 hours to 16 hours!
Ive been trying to figure this out for days. Would you mind explaining why this FNR=NR method works so much quicker and what [$3] specifies?

You've been so helpful

Thank you so much!

grail · 01-22-2012, 11:20 PM

FNR = The current record number in the current file. FNR is incremented each time a new record is read (see Records). It is reinitialized to zero each time a new input file is started.
NR = The number of input records awk has processed since the beginning of the program's execution (see Records). NR is incremented each time a new record is read.
$X = Where 'X' corresponds to the field created by the field separator, hence in our example $3 would refer to the third field
[] = Square brackets are used to denote an array, hence low[] is an array and $3 is being used as the array index

See here for a full awk manual

uni_lel7 · 01-23-2012, 03:18 PM

Fantastic, Thanks again!