[SOLVED] awk command to direct the output to multiple files

Aemm · 11-21-2018, 03:55 AM

Hi all,
I have an input file containing the names of the files on which the processing needs to be done on certain columns. My input file looks like this

Code:

FID       IID  PHENO    CNT   CNT2    SCORE
  00010   0001002      2     28      9 -0.00843036
  00017   0001702      1     28      9 0.00710286
  00028   0002801      2     28      9 -0.00125893

I want to split the file on the basis of the 3rd column i.e., if there is "1" I need to output only the "score" 6th column into the separate file having the extension .control. If the 3rd column has the value "2" I need to output the 6th column into the file having the extension .case. Afterwards I need to run the R function on the case and control files. My code is

Code:

IFS=$'\n'
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt);
do awk '{ if ($3 == 1) {print $6 >> $FileName.control} else {print $6 >> $FileName.case}}' $FileName; 
done;

The above command is giving the error

Quote:

awk: cmd. line:1: { if ($3 == 1) {print $6 >> $FileName.control} else {print $6 >> $FileName.case}}
awk: cmd. line:1: ^ syntax error

However if I run the syntax only to output the .control files i.e., having the value "1" in the 3rd column and output the respective score in a separate file, that is working.

Code:

IFS=$'\n'
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt);  
do awk '{ if ($3 == 1) { print $6 } }' $FileName > $FileName.control;  done;

But I am unable to embed the else condition within the code. can anyone let me know whats going wrong? Thanks.

pan64 · 11-21-2018, 04:15 AM

what you mixed/missed is: Filename is evaluated by the shell, not by the awk. The awk script itself is: '{ if ($3 == 1) { print $6 } }' nothing more.

You cannot mix the two languages, the awk script cannot use $Filename as variable (which was defined in bash). If you want to do that you need to export it in shell and read that environment variable from awk (or pass this variable to the awk).

l0f4r0 · 11-21-2018, 04:39 AM

Indeed, replace your code with the following:

Code:

for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); do cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else {print $6 >>fn".case"}}'; done;

allend · 11-21-2018, 07:10 AM

This can also be done without invoking 'awk'.

Code:

#!/bin/bash

myfile="Artery_Aorta-ListOfScoreFilesForScript"

while read -a aline; do	
  if ((	${aline[2]} == 1 )); then 
    echo "${aline[5]}" >> "$myfile".control
  elif (( ${aline[2]} == 2 )); then 
    echo "${aline[5]}" >> "$myfile".case
  fi 
done < "$myfile".txt

If you are using R, why not just read the entire file, make PHENO a factor and then subset as necessary?

pan64 · 11-21-2018, 07:54 AM

Quote:

Originally Posted by l0f4r0

Indeed, replace your code with the following:

Code:

for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); do cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else {print $6 >>fn".case"}}'; done;

useless use of cat:
cat file | awk 'script' can be replaced by awk 'script' file
also for and cat together not suggested, instead while should be used. awk knows the file processed, do not need to set a variable

Code:

while read -r line; do
    awk '{if ($3 == 1) {print $6 >>FILENAME".control"} else {print $6 >>FILENAME".case"}}' $line
done < Artery_Aorta-ListOfScoreFilesForScript.txt

(not tested)

allend · 11-21-2018, 08:02 AM

Quote:

awk '{if ($3 == 1) {print $6 >>FILENAME".control"} else {print $6 >>FILENAME".case"}}' $line

That fails on the header line in the input file.

l0f4r0 · 11-21-2018, 08:21 AM

Quote:

Originally Posted by allend

That fails on the header line in the input file.

You mean "it works technically but the else part grabs the header line in files"?
If so, OP should prepend a sed or grep to his/her awk or add a condition in his/her else part...

allend · 11-21-2018, 08:53 AM

Quote:

If so, OP should prepend a sed or grep to his/her awk ...

Perhaps just 'tail -n+2'?

pan64 · 11-21-2018, 09:09 AM

awk can handle it, no need any external tool (for example post #4 has a solution). Also the variable NR can be used.

Code:

NR == 1 { next }

Aemm · 11-21-2018, 08:59 PM

Thankyou all for the reply. The code work. I have added the else if condition is the code which works for the header line as well. But I am encountering another problem. I want to run some statistical test using R on the cases and control files. Now the variable $FileName is not being read by the R command. I think I am gain mixing the two languages i.e., bash and R. For the R t-test, shall I loop again to grab the controls and cases files?

Aemm · 11-21-2018, 10:26 PM

The code which I am trying to run (on HPC) is

Code:

cd $PBS_O_WORKDIR
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); 
do 
cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else if ($3 == 2) {print $6 >>fn".case"}}'; 
Rscript t.test($FileName.control, $FileName.case, alternative = "two.sided", paired = FALSE, var.equal = FALSE)
done;

MadeInGermany · 11-22-2018, 12:27 AM

Quote:

Originally Posted by allend

This can also be done without invoking 'awk'.

Code:

#!/bin/bash

myfile="Artery_Aorta-ListOfScoreFilesForScript"

while read -a aline; do	
  if ((	${aline[2]} == 1 )); then 
    echo "${aline[5]}" >> "$myfile".control
  elif (( ${aline[2]} == 2 )); then 
    echo "${aline[5]}" >> "$myfile".case
  fi 
done < "$myfile".txt

If you are using R, why not just read the entire file, make PHENO a factor and then subset as necessary?

Each >> is an open/append/close.
This is very I/O intensive; if NFS it would stress the NFS server.
The following opens/closes the file once, and even gives you the choice between > and >> (overwrite or append an existing file)

Code:

#!/bin/bash

myfile="Artery_Aorta-ListOfScoreFilesForScript"

while read -a aline; do	
  if ((	${aline[2]} == 1 )); then 
    echo "${aline[5]}"
  elif (( ${aline[2]} == 2 )); then 
    echo "${aline[5]}" >&3
  fi
done < "$myfile".txt > "$myfile".control 3> "$myfile".case

Note that the print in awk works like this, too:
the >> or > decides how the file is opened at the first write. Subsequent writes go to the stream i.e. append.

MadeInGermany · 11-22-2018, 03:23 AM

Quote:

Originally Posted by Aemm

The code which I am trying to run (on HPC) is

Code:

cd $PBS_O_WORKDIR
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); 
do 
cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else if ($3 == 2) {print $6 >>fn".case"}}'; 
Rscript t.test($FileName.control, $FileName.case, alternative = "two.sided", paired = FALSE, var.equal = FALSE)
done;

It looks okay. I would avoid the redundant construction of file names. And no UUOC of course!

Code:

cd $PBS_O_WORKDIR || exit
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt)
do
  fn1=$FileName.control fn2=$FileName.case
  awk -v fn1="$fn1" -v fn2="$fn2" '{if ($3 == 1) {print $6 >fn1} else if ($3 == 2) {print $6 >fn2}}' $FileName
  Rscript t.test("$fn1", "$fn2", alternative = "two.sided", paired = FALSE, var.equal = FALSE)
done

As I said before, in awk's print you can use > or >>
The difference is how to open an existing file at the first write.

The ( ) are interpreted by the shell. This might still cause an error. But I don't know yet what Rscript is.