LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   awk command to direct the output to multiple files (https://www.linuxquestions.org/questions/linux-newbie-8/awk-command-to-direct-the-output-to-multiple-files-4175642759/)

Aemm 11-21-2018 03:55 AM

awk command to direct the output to multiple files
 
Hi all,
I have an input file containing the names of the files on which the processing needs to be done on certain columns. My input file looks like this

Code:

FID      IID  PHENO    CNT  CNT2    SCORE
  00010  0001002      2    28      9 -0.00843036
  00017  0001702      1    28      9 0.00710286
  00028  0002801      2    28      9 -0.00125893

I want to split the file on the basis of the 3rd column i.e., if there is "1" I need to output only the "score" 6th column into the separate file having the extension .control. If the 3rd column has the value "2" I need to output the 6th column into the file having the extension .case. Afterwards I need to run the R function on the case and control files. My code is

Code:

IFS=$'\n'
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt);
do awk '{ if ($3 == 1) {print $6 >> $FileName.control} else {print $6 >> $FileName.case}}' $FileName;
done;

The above command is giving the error

Quote:

awk: cmd. line:1: { if ($3 == 1) {print $6 >> $FileName.control} else {print $6 >> $FileName.case}}
awk: cmd. line:1: ^ syntax error
However if I run the syntax only to output the .control files i.e., having the value "1" in the 3rd column and output the respective score in a separate file, that is working.

Code:

IFS=$'\n'
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); 
do awk '{ if ($3 == 1) { print $6 } }' $FileName > $FileName.control;  done;

But I am unable to embed the else condition within the code. can anyone let me know whats going wrong? Thanks.

pan64 11-21-2018 04:15 AM

what you mixed/missed is: Filename is evaluated by the shell, not by the awk. The awk script itself is: '{ if ($3 == 1) { print $6 } }' nothing more.

You cannot mix the two languages, the awk script cannot use $Filename as variable (which was defined in bash). If you want to do that you need to export it in shell and read that environment variable from awk (or pass this variable to the awk).

l0f4r0 11-21-2018 04:39 AM

Indeed, replace your code with the following:
Code:

for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); do cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else {print $6 >>fn".case"}}'; done;

allend 11-21-2018 07:10 AM

This can also be done without invoking 'awk'.
Code:

#!/bin/bash

myfile="Artery_Aorta-ListOfScoreFilesForScript"

while read -a aline; do       
  if ((        ${aline[2]} == 1 )); then
    echo "${aline[5]}" >> "$myfile".control
  elif (( ${aline[2]} == 2 )); then
    echo "${aline[5]}" >> "$myfile".case
  fi
done < "$myfile".txt

If you are using R, why not just read the entire file, make PHENO a factor and then subset as necessary?

pan64 11-21-2018 07:54 AM

Quote:

Originally Posted by l0f4r0 (Post 5928280)
Indeed, replace your code with the following:
Code:

for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt); do cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else {print $6 >>fn".case"}}'; done;

useless use of cat:
cat file | awk 'script' can be replaced by awk 'script' file
also for and cat together not suggested, instead while should be used. awk knows the file processed, do not need to set a variable
Code:

while read -r line; do
    awk '{if ($3 == 1) {print $6 >>FILENAME".control"} else {print $6 >>FILENAME".case"}}' $line
done < Artery_Aorta-ListOfScoreFilesForScript.txt

(not tested)

allend 11-21-2018 08:02 AM

Quote:

awk '{if ($3 == 1) {print $6 >>FILENAME".control"} else {print $6 >>FILENAME".case"}}' $line
That fails on the header line in the input file.

l0f4r0 11-21-2018 08:21 AM

Quote:

Originally Posted by allend (Post 5928344)
That fails on the header line in the input file.

You mean "it works technically but the else part grabs the header line in files"?
If so, OP should prepend a sed or grep to his/her awk or add a condition in his/her else part...

allend 11-21-2018 08:53 AM

Quote:

If so, OP should prepend a sed or grep to his/her awk ...
Perhaps just 'tail -n+2'?

pan64 11-21-2018 09:09 AM

awk can handle it, no need any external tool (for example post #4 has a solution). Also the variable NR can be used.
Code:

NR == 1 { next }

Aemm 11-21-2018 08:59 PM

Thankyou all for the reply. The code work. I have added the else if condition is the code which works for the header line as well. But I am encountering another problem. I want to run some statistical test using R on the cases and control files. Now the variable $FileName is not being read by the R command. I think I am gain mixing the two languages i.e., bash and R. For the R t-test, shall I loop again to grab the controls and cases files?

Aemm 11-21-2018 10:26 PM

The code which I am trying to run (on HPC) is

Code:

cd $PBS_O_WORKDIR
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt);
do
cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else if ($3 == 2) {print $6 >>fn".case"}}';
Rscript t.test($FileName.control, $FileName.case, alternative = "two.sided", paired = FALSE, var.equal = FALSE)
done;


MadeInGermany 11-22-2018 12:27 AM

Quote:

Originally Posted by allend (Post 5928320)
This can also be done without invoking 'awk'.
Code:

#!/bin/bash

myfile="Artery_Aorta-ListOfScoreFilesForScript"

while read -a aline; do       
  if ((        ${aline[2]} == 1 )); then
    echo "${aline[5]}" >> "$myfile".control
  elif (( ${aline[2]} == 2 )); then
    echo "${aline[5]}" >> "$myfile".case
  fi
done < "$myfile".txt

If you are using R, why not just read the entire file, make PHENO a factor and then subset as necessary?

Each >> is an open/append/close.
This is very I/O intensive; if NFS it would stress the NFS server.
The following opens/closes the file once, and even gives you the choice between > and >> (overwrite or append an existing file)
Code:

#!/bin/bash

myfile="Artery_Aorta-ListOfScoreFilesForScript"

while read -a aline; do       
  if ((        ${aline[2]} == 1 )); then
    echo "${aline[5]}"
  elif (( ${aline[2]} == 2 )); then
    echo "${aline[5]}" >&3
  fi
done < "$myfile".txt > "$myfile".control 3> "$myfile".case

Note that the print in awk works like this, too:
the >> or > decides how the file is opened at the first write. Subsequent writes go to the stream i.e. append.

MadeInGermany 11-22-2018 03:23 AM

Quote:

Originally Posted by Aemm (Post 5928659)
The code which I am trying to run (on HPC) is

Code:

cd $PBS_O_WORKDIR
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt);
do
cat "$FileName" | awk -v fn="$FileName" '{if ($3 == 1) {print $6 >>fn".control"} else if ($3 == 2) {print $6 >>fn".case"}}';
Rscript t.test($FileName.control, $FileName.case, alternative = "two.sided", paired = FALSE, var.equal = FALSE)
done;


It looks okay. I would avoid the redundant construction of file names. And no UUOC of course!
Code:

cd $PBS_O_WORKDIR || exit
for FileName in $(cat Artery_Aorta-ListOfScoreFilesForScript.txt)
do
  fn1=$FileName.control fn2=$FileName.case
  awk -v fn1="$fn1" -v fn2="$fn2" '{if ($3 == 1) {print $6 >fn1} else if ($3 == 2) {print $6 >fn2}}' $FileName
  Rscript t.test("$fn1", "$fn2", alternative = "two.sided", paired = FALSE, var.equal = FALSE)
done

As I said before, in awk's print you can use > or >>
The difference is how to open an existing file at the first write.

The ( ) are interpreted by the shell. This might still cause an error. But I don't know yet what Rscript is.


All times are GMT -5. The time now is 02:40 PM.