[SOLVED] Bash script to read csv file with multiple length columns

japena · 07-27-2011, 12:24 AM

I've searched everywhere and I can't come up with a good solution. Unfortunately I'm kind of stuck using a shell script to achieve the following.

I have the following type of data

0.46,0.45,0.43,0.42,0.43,0.52,0.57,0.65,0.69,0.70
0.71,0.95,0.95,1.00,1.02,1.03,1.02,1.16
1.21,1.41,1.42,1.40,1.40,1.39,1.39,1.35,1.45
1.67,1.66,1.65,1.65,1.63,1.65,1.68,1.66,1.64,1.60,1.58
1.56,1.52,1.47,1.42

For each line I need to find the average, min, and max. I've seen plenty of solutions where the number of columns is fixed, unfortunately for me these lines can get pretty large.

My thought was to read each line individually into an array, loop through the array and find the avg, min, and max that way but i haven't had much luck.

I can read each line using a while loop but I'm having trouble with the array part, or perhaps that's not the best solution? Any suggestions, help is appreciated.

catkin · 07-27-2011, 12:34 AM

Does it have to be bash? Bash is not naturally suited to this type of problem, either in ease of programming nor running speed.

If it has to be bash then it is possible so please ask.

japena · 07-27-2011, 12:36 AM

Yes unfortunately it has to be bash or shell. i would personally much rather use ruby or almost anything else.

catkin · 07-27-2011, 12:58 AM

Quote:

Originally Posted by japena

Yes unfortunately it has to be bash or shell. i would personally much rather use ruby or almost anything else.

Before launching into a pure bash solution ...

bash does not have fractional arithmetic capability. The normal solution is for bash to call the bc or expr commands. bash is a command shell, it is a way of running commands that also has some language constructs. Are you allowed to call awk from your bash script?

japena · 07-27-2011, 01:08 AM

Yes I was planning on using expr to do the sum while keeping a counter to be a able to divide afterwards for the average. Yes I'm able to use awk. I saw several examples that use awk but all of them had a fixed amount of columns and most of the time only 2 or 3 columns which doesn't work for me, I didn't see how I could use awk.

catkin · 07-27-2011, 01:15 AM

So it would be OK to use an awk script directly ... ?

japena · 07-27-2011, 01:16 AM

catkin · 07-27-2011, 01:19 AM

OK

I've got to go now but will help with an awk solution later if nobody else has by then

japena · 07-27-2011, 01:23 AM

Great would really appreciate it. I should mention that I can't use expr after all because the file doesn't only contain integers.

crts · 07-27-2011, 01:53 AM

Hi,

I had been thinking about a pure bash solution before you stated that awk is OK to use. Well, awk is definitely the way to go. However, since I spent some time thinking about pure bash I'd still like to present a clumsy pure bash solution:

Code:

IFS=',';while read line; do set -- $line; echo "10 k 0 ${line//,/+}+${#}/ p" | dc ; done < file

You will notice that values like '0.123' are just printed as '.123'. I am not sure if there is any way to tell 'bc' to format the output like a normal person would expect it. So I tried to compute the result with 'dc'. But it has the same problem regarding the formatting.

crts · 07-27-2011, 02:17 AM

Ok,

an awk solution that calulates average, min and max values:

Code:

awk -F ',' '{min=$1;max=$1;a=0;for (i=1;i<=NF;i++) {a+=$i;if ($i < min){min=$i};if ($i > max){max=$i}};print "average: " a/NF " min: " min " max: " max}' file

Not sure if the results are needed for further processing. If so, then you might need an alternative output format.

catkin · 07-27-2011, 05:44 AM

Quote:

Originally Posted by crts

values like '0.123' are just printed as '.123'. I am not sure if there is any way to tell 'bc' to format the output like a normal person would expect it.

AFAIK there is no way to tell bc to do that. You could capture the bc or dc output and format it with bash' printf:

Code:

IFS=','
while read line
do 
    set -- $line
    avg=$( echo "10 k 0 ${line//,/+}+${#}/ p" | dc )
    echo printf '%1.2f' $avg
done < file
unset IFS # Effectively restores the default value

japena · 07-27-2011, 09:18 AM

Hi crts, both solutions work great I really have to start learning awk.

Not sure I understand this line in the bash solution

echo "10 k 0 ${line//,/+}+${#}/ p" | dc

Could you tell me what the "10 k 0" and "p" are?

crts · 07-27-2011, 09:51 AM

Quote:

Originally Posted by japena

Hi crts, both solutions work great I really have to start learning awk.

Not sure I understand this line in the bash solution

echo "10 k 0 ${line//,/+}+${#}/ p" | dc

Could you tell me what the "10 k 0" and "p" are?

Ok,

there are some things you need to know about dc:
1. it is a reverse polish notation calulator
2. division is by default an integer division, i.e. 3/2 will return 1 as result. You have to explicitly set the precision to get the fractional part.

Let's break the above statement down:
${line//,/+}
This is bash's string substitution mechanism. Suppose we have the follwing input
a,b
Afterwards the input will be:
a+b

As I mentioned dc is an RPN calulator, so it would expect input in the form of
a b +
This is not yet the case, so we need to manipulate the input a bit more. Instead of a complicated reordering I simply prepend a zero and append a plus:
0 a + b +
This is indeed a valid RPN expression and equivalent to a+b (infix notation).

${#} is the number of arguments that have been "created" by 'set -- $line'. This is what we need to divide by to get the average - in the example that would be 2. In RPN this looks like
0 a + b + 2 /
This is our expression that is equivalent to (a+b)/2. After it is calculated we need to tell dc to print the result. This is what 'p' does.
The '10 k' part sets the precision. As I said, division is by default an integer division. To get the fraction we set '10 k' which tells dc to truncate 10 numbers after the decimal point. E.g.:
'3 2 / p' will by default print 1
'2 k 3 2 / p' will print 1.50
'10 k 3 2 / p' will print 1.5000000000

The calculation is a stack based operation process. If you are not familiar with RPN then this will probably look a bit confusing at first. Read the link I provided and consult the manpage of dc for more information.

PS: I had a solution with bc first, which is an infix calculator. As I mentioned in a previous post, there was the problem with the formatting, so I experimented with dc to see if it has the same problem. It does.
I only posted the bash solution because I had been thinking about it before I knew that awk is OK to use. I do not really recommend it.
I posted the dc solution instead of bc because, well, I thought if I am going to post an ugly solution then it might as well be the ugliest one I came up with

japena · 07-27-2011, 12:58 PM

Thanks for the explanation it makes a lot more sense now, I'm not familiar with RPN which made it that much more confusing. I'm definitely going to go with the awk solution as it's much more elegant and easier to understand. I have to do some more formatting but I think I can take it from here now. Just one more thing, what's the proper format of the awk command in multiple lines?