[SOLVED] disaster performance 'while read' script

slik · 04-01-2016, 10:18 AM

Hi,

I have input looking like:
XP035649954
20160322
20160322
20160324
XP035649953
20160322
20160322
20160324

I want to output every 4 lines in one single line, like:
XP035649954 20160322 20160322 20160324
XP035649953 20160322 20160322 20160324

I wrote a script which does the job, but it has terrible performance. I use the "while read" construct quite often, but it's always slow. I would like to understand why it is performing so badly. Additionally, how can I speed up what I want? Can this be done in awk?

Script I use:
-------------
while read line
do
out=$(echo $line)
for i in {1..3}
do
read line
out=$(echo $out $line)
done
echo $out >>outfile
done <infile

Thanks in advance

grail · 04-01-2016, 12:27 PM

Firstly, please use [code][/code] tags around code / data

Yes this can be done in awk ... look into the NR variable

As for your script, you have a loop inside a loop so obviously this will take longer than a single loop.
On top of that, you use process substitution of echo to reassign to a variable when you could simply append the variable data:

Code:

out=$(echo $line)
# is just
out=$line

out=$(echo $out $line)
# is just
out="$out $line"

Like the awk solution, why not have a simple counter and each time you reach a particular number of lines read, you append a new line to the data.

Your initial question talks about outputting the data, but your example shows it being written to a new file, in that case build the line as above and simply echo to new file once
you reach counter value.

michaelk · 04-01-2016, 12:44 PM

Your using the echo command to remove the new line character and append which is inefficient. You can use bash's parameter expansion instead

Code:

while read line
do
out=${line%$'\n'}
for i in {1..3}
do
read line
line=${line%$'\n'}
out=$out" "$line
done
echo $out >> outfile
done < infile

On a 4000 line file the time it takes is:

Quote:

real 0m5.730s
user 0m0.680s
sys 0m1.280s

vs
real 0m0.320s
user 0m0.272s
sys 0m0.036s

Note: The first time I ran your script the times were 2x slower. Not sure what what changed.

rknichols · 04-01-2016, 01:01 PM

Quote:

Originally Posted by michaelk

Your using the echo command to remove the new line character which is inefficient. You can use bash's parameter expansion instead

Code:

while read line
do
out=${line%$'\n'}
for i in {1..3}
do
read line
line=${line%$'\n'}
out=$out" "$line
done
echo $out >> outfile
done < infile

That script is inefficiently re-opening the output file for every line written. That's a fairly expensive operation. Try this:

Code:

while read out
do
    for i in {1..3}
    do
        read line
        out="$out $line"
    done
    echo $out
done < infile >outfile

I've also eliminated unnecessary variable manipulation. Time for a 4000 line input file is 32ms.

slik · 04-01-2016, 01:32 PM

Thanks all for your feedback.
Very helpful comments indeed.

michaelk · 04-01-2016, 02:27 PM

Quote:

while read out
do
for i in {1..3}
do
read line
out="$out $line"
done
echo $out
done < infile >outfile

Much better then my script... Although the relative time difference from mine (I'm using a VM) is only .11 sec.

grail · 04-01-2016, 03:06 PM

Couple of alternatives:

Code:

#!/usr/bin/env bash

cnt=1

while read line
do
	[[ "$out" ]] && out="$out $line" || out=$line

	if (( cnt++ % 4 == 0 ))
	then
		echo $out
		unset out
	fi
done< infile > outfile

real	0m0.081s
user	0m0.073s
sys	0m0.007s

awk 'ORS = (NR % 4) ? " " : RS' infile > outfile

real	0m0.008s
user	0m0.007s
sys	0m0.000s

Both of the above times are with 4000 lines

MadeInGermany · 04-03-2016, 10:05 AM

Another bash/ksh

Code:

while read a0; read a1; read a2; read a3
do
  echo "$a0 $a1 $a2 $a3"
done < infile

Interesting: with an array it becomes awfully slow!?

Code:

while read A[0]; read A[1]; read A[2]; read A[3]
do
  echo "${A[@]}"
done < infile

rknichols · 04-03-2016, 11:12 AM

There are a bazillion ways to do it.

Code:

sed '{N;N;N;s/\n/ /g}' infile >outfile

Just 2 milliseconds for that one with 4000 lines.