LinuxQuestions.org - Help with directing awk output to variable

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Help with directing awk output to variable (https://www.linuxquestions.org/questions/linux-newbie-8/help-with-directing-awk-output-to-variable-4175448994/)

Help with directing awk output to variable

Hi everyone,

New guy here with a problem that will hopefully have an easy solution, but I just can't seem to manage.

So, I have a large list of files that I need to process using the same command line program, and I'm trying to write a small shell script to automate this. I wrote something that will read the input file name from a text file, and repeat the command for each of those files. So far so good. My problem though is with naming the output. Each file is named in the general format "lane_number_bla_bla_bla", and they are processed in pairs. So, there will be a "lane_1_bla_bla_bla_001" and "lane_1_bla_bla_bla_002" that need to combine into a single output file. For this, I'm trying to use awk to read the sample number from the .txt list of input files and parse it into the output file number. Here's the code I came up with (note that the echo statement before the command is there just for testing; it's removed when it comes to run the actual program; also this is not the actual command which is rather more complicated, but the principle still applies):

echo "Which input1 should I use?"
read text
input1=$text # Defines text file that includes filenames of mate 1

echo "Which input2 should I use?"
read text
input2=$text # Defines text file that includes filenames of mate 2

echo "How many lines?"
read text
n=$text # Defines how many lines should be read from filename text files

for i in $(seq 1 $n)
do
awkinput1=$(awk NR==$i $input1) # Defines text at line "i" in filename text file 1 as variable to replace in command line for mate 1
awkinput2=$(awk NR==$i $input2) # Defines text at line "i" in filename text file 2 as variable to replace in command line for mate 2
num=$(awk 'NR==$i{print $2}' FS="_" $input1) # Defines sample number from "i" in filename text file 1 as variable to replace in concatenated file names
lane=$(awk 'NR==$i{print $1}' FS="_" $input1) # Defines lane number from "i" in filename text file 1 as variable to replace in concatenated file names

echo "command $awkinput1.in > $awkinput1.out && command $awkinput2.in > $awkinput2.out && command cat $awkinput1.out $awkinput2.in > $num-$lane-CAT.out &" # Command line of interest

if (( $i % 10 == 0 )); then wait; fi # Limit to 10 concurrent subshells.
done

When I run this, both $awkinput fields get replaced properly in the comand line by the appropriate filename, but not the $num and $lane fields, which print nothing.

So, what am I doing wrong? I'm sure it's pretty simple, but I tried quite a lot of different ways to format the relevant awk command, and nothing seems to work. I'm doing this on a remote linux server using SSH protocol, if it makes a difference.

Thanks a lot!

Code:

awkinput1=$(awk NR==$i $input1)

awkinput2=$(awk NR==$i $input2)

num=$(awk 'NR==$i{print $2}' FS="_" $input1)

lane=$(awk 'NR==$i{print $1}' FS="_" $input1)

I may be off the mark here, but doesn't the == represent equality and the = represent value assignment? If I am correct, then replacing == with = to assign value to a variable should fix the problem.

That's the way it seems to work in examples I've looked up.

Code:

num=$(awk 'NR==$i{print $2}' FS="_" $input1)

You are trying to reference the shell variable i from within awk, but awk doesn't see shell variables. Note that the previous awk commands didn't have quotes around the awk program so the value of i was substituted by the shell into the awk program first.

Instead of having awk reading the whole input file multiple times to get a single line each time, I would suggest reading the files sequentially from the shell:

Code:

line=0

while true ; do

    # read next line from $input1 and $input2

    read in1 <&3 || break

    read in2 <&4 || break

    ((line++))



    # get lane and sample number from input1

    IFS=_ read lane num restofline <<<"$in1"



    echo "command $in1.in > $in1.out && command $in2.in > $in2.out && command cat $in1.out $in2.out > $num-$lane-CAT.out &" # Command line of interest

    # or maybe just?

    echo "command $in1.in > $num-$lane-CAT.out && command $in2.in >> $num-$lane-CAT.out &" 



    # Limit to 10 concurrent subshells.

    if (( $line % 10 == 0 )) ; then wait ; fi



    # finished $n lines

    if (( $line >= $n )) ; then break ; fi

done 3< "$input1" 4< "$input2"

Thank you both for the quick replies. ntubski got what the problem was, the first single quote was placed in such a way that included the NR defining expression. When I took that out of the quote, it worked just fine. It had to be something silly like that, but the devil is always in the details. I was quite interested in the scipt you suggested as well. My reckoning was that awk was more flexible in taking specific fields from composite names like the ones in my files, but maybe reading them from the shell will be faster and more efficient. I'll definitely give it a go.

Once again, thanks a lot for your help, you guys just saved me an awful lot of time and helped me understand shell scripting a bit better! :)

Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.

You can avoid the external file if you used an array instead.

Code:

files=( '' lane_1_bla_bla_bla_* )



for (( i=1 ; i<=${#files[@]} ; i+=2 )); do



    printf -v outfile 'lane_1_bla_bla_bla_output_%03d.txt' "$i"  #zero-pads the output number

    cat "${files[i]}" "${files[i+1]}" > "$outfile"



done

Note the use of a blank array entry at the beginning, so that the initial index 0 is ignored, allowing the rest to match up. You can also add sanity checks to ensure that you're matching up the correct files.

For reference, a textfile solution could be easier if the file contained two names per line.

Assuming that none of the filenames contains whitespace:

Code:

i=1

while read -r fname1 fname2; do



    printf -v newfile 'outputfile%03d.txt' $(( i++ ))

    cat "$fname1" ""$fname2" > "$newfile"



done <inputfile.txt

If the names can contain spaces, then you'd have to use a different delimiter. To read a colon-delimited list, for example, just change the first line to this:

Code:

while IFS=':' read -r fname1 fname2 ; do

How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
http://mywiki.wooledge.org/BashFAQ/001