loading a variable with awk in a for loop

Dr_Noob · 03-15-2012, 06:18 PM

SO I am writing a script that needs to take a file as input that looks like:

GENE CHR START END
GALNT10 5 153570295 153800543
KLHL32 6 97372496 97588630
FTO 16 53737875 54148379
MC4R 18 58038564 58040001
SEC16B 1 177897489 177939050
ADCY3 2 25042039 25142055
GNPDA2 4 44704168 44728612

and use that info to process another file and generate output that fall within the boundaries of START and END.

The script takes several arguments, where REGION is the file above, and FREQAA and FREQEA are additional files:

FREQAA=$1
FREQEA=$2
REGION=$3
BUFFER=$4

Everything was working fine until I put in this for loop:

for i in $(awk '{print $2}' $GENE.bim | head -$NR);do

EA1=$(grep -w -$i $FREQEA | awk '{print $3}')
EA2=$(grep -w -$i $FREQEA | awk '{print $4}')
AA1=$(grep -w -$i $FREQAA | awk '{print $3}')
AA2=$(grep -w -$i $FREQAA | awk '{print $4}')

in other similar lines earlier in the script, this type of thing works fine, but inside the for loop, instead of loading $EA1 with the third column of the line containing $i, it writes the value of the 3rd argument.

grail · 03-16-2012, 04:48 AM

I think you may have to explain further what you are attempting to do as the present code snippet makes no sense to me. Also, please use [code][/code] tags when
displaying code.

Maybe you could start by explaining where the NR variable comes from in the line below:

Code:

for i in $(awk '{print $2}' $GENE.bim | head -$NR);do

David the H. · 03-17-2012, 11:50 AM

To start with, Don't read lines with for. Use a while+read loop, with the awk command supplied by a process substitution (assuming bash, of course).

You shouldn't need to use head either. You can import the variable directly into awk and use it to output the lines you want.

Code:

while read i; do

	commands

done <( awk '( NR <= ln ) { print $2 }' "ln=$NR" "$GENE.bim" )

Edit: Similarly, you shouldn't need to use grep and awk (or grep and sed) together in the sub-commands.

Code:

EA1=$( awk '( $0 ~ pat ) { print $3 }' 'pat=\\<'"$i"'\\>' "$FREQEA" )

The "pat" variable in awk is treated as a regex to match, so that only lines containing the pattern will be printed. \< and \> are regex word boundary anchors, used to replicate the behavior of grep's "-w" option. Unfortunately though, awkward quoting and backslashing is needed to properly pass them to awk. If you don't need the whole-word matching condition, you can simply use "pat=$i".

Edit2: a slightly cleaner way to handle the regex, by adding the word boundries to the variable inside awk instead.

Code:

EA1=$( awk '{ pat="\\<" pat "\\>" } ; ( $0 ~ pat ) { print $3 }' "pat=$i" "$FREQEA" )

Speaking of which, QUOTE ALL OF YOUR VARIABLE SUBSTITUTIONS. You should never leave the quotes off a variable expansion unless you explicitly want the resulting string to be word-split by the shell. This is a vitally important concept in scripting, so train yourself to do it correctly now. You can learn about the exceptions later.

Also, environment variables are generally all upper-case. So while not absolutely necessary, it's good practice to keep your own user variables in lower-case or mixed-case, to help differentiate them.