[SOLVED] Bash script to read csv file with multiple length columns
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
For each line I need to find the average, min, and max. I've seen plenty of solutions where the number of columns is fixed, unfortunately for me these lines can get pretty large.
My thought was to read each line individually into an array, loop through the array and find the avg, min, and max that way but i haven't had much luck.
I can read each line using a while loop but I'm having trouble with the array part, or perhaps that's not the best solution? Any suggestions, help is appreciated.
Yes unfortunately it has to be bash or shell. i would personally much rather use ruby or almost anything else.
Before launching into a pure bash solution ...
bash does not have fractional arithmetic capability. The normal solution is for bash to call the bc or expr commands. bash is a command shell, it is a way of running commands that also has some language constructs. Are you allowed to call awk from your bash script?
Yes I was planning on using expr to do the sum while keeping a counter to be a able to divide afterwards for the average. Yes I'm able to use awk. I saw several examples that use awk but all of them had a fixed amount of columns and most of the time only 2 or 3 columns which doesn't work for me, I didn't see how I could use awk.
I had been thinking about a pure bash solution before you stated that awk is OK to use. Well, awk is definitely the way to go. However, since I spent some time thinking about pure bash I'd still like to present a clumsy pure bash solution:
Code:
IFS=',';while read line; do set -- $line; echo "10 k 0 ${line//,/+}+${#}/ p" | dc ; done < file
You will notice that values like '0.123' are just printed as '.123'. I am not sure if there is any way to tell 'bc' to format the output like a normal person would expect it. So I tried to compute the result with 'dc'. But it has the same problem regarding the formatting.
values like '0.123' are just printed as '.123'. I am not sure if there is any way to tell 'bc' to format the output like a normal person would expect it.
AFAIK there is no way to tell bc to do that. You could capture the bc or dc output and format it with bash' printf:
Code:
IFS=','
while read line
do
set -- $line
avg=$( echo "10 k 0 ${line//,/+}+${#}/ p" | dc )
echo printf '%1.2f' $avg
done < file
unset IFS # Effectively restores the default value
Last edited by catkin; 07-27-2011 at 05:46 AM.
Reason: brevity and clarity
Hi crts, both solutions work great I really have to start learning awk.
Not sure I understand this line in the bash solution
echo "10 k 0 ${line//,/+}+${#}/ p" | dc
Could you tell me what the "10 k 0" and "p" are?
Ok,
there are some things you need to know about dc:
1. it is a reverse polish notation calulator
2. division is by default an integer division, i.e. 3/2 will return 1 as result. You have to explicitly set the precision to get the fractional part.
Let's break the above statement down:
${line//,/+}
This is bash's string substitution mechanism. Suppose we have the follwing input
a,b
Afterwards the input will be:
a+b
As I mentioned dc is an RPN calulator, so it would expect input in the form of
a b +
This is not yet the case, so we need to manipulate the input a bit more. Instead of a complicated reordering I simply prepend a zero and append a plus:
0 a + b +
This is indeed a valid RPN expression and equivalent to a+b (infix notation).
${#} is the number of arguments that have been "created" by 'set -- $line'. This is what we need to divide by to get the average - in the example that would be 2. In RPN this looks like
0 a + b + 2 /
This is our expression that is equivalent to (a+b)/2. After it is calculated we need to tell dc to print the result. This is what 'p' does.
The '10 k' part sets the precision. As I said, division is by default an integer division. To get the fraction we set '10 k' which tells dc to truncate 10 numbers after the decimal point. E.g.:
'3 2 / p' will by default print 1
'2 k 3 2 / p' will print 1.50
'10 k 3 2 / p' will print 1.5000000000
The calculation is a stack based operation process. If you are not familiar with RPN then this will probably look a bit confusing at first. Read the link I provided and consult the manpage of dc for more information.
PS: I had a solution with bc first, which is an infix calculator. As I mentioned in a previous post, there was the problem with the formatting, so I experimented with dc to see if it has the same problem. It does.
I only posted the bash solution because I had been thinking about it before I knew that awk is OK to use. I do not really recommend it.
I posted the dc solution instead of bc because, well, I thought if I am going to post an ugly solution then it might as well be the ugliest one I came up with
Last edited by crts; 07-27-2011 at 10:05 AM.
Reason: typos
Thanks for the explanation it makes a lot more sense now, I'm not familiar with RPN which made it that much more confusing. I'm definitely going to go with the awk solution as it's much more elegant and easier to understand. I have to do some more formatting but I think I can take it from here now. Just one more thing, what's the proper format of the awk command in multiple lines?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.