Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
|
04-11-2012, 04:29 PM
|
#1
|
|
LQ Newbie
Registered: Mar 2012
Posts: 13
Rep: 
|
Using BASH to automate data processing and table generation.
Hello,
I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.
1) Open each file and sort both columns of the data based upon column 2; the sort was ascending on column 2.
2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.
3) Divide each value in column 2 by it's corresponding value in column 1 and place the result in column 3. (For excel users, column C was =B2/A2)
4) Calculate the mean for column 3
5) Generate a simple two column table containing the file names in column 1 and its corresponding column 3 mean in column 2.
I am almost certain that if I knew how to do it, a BASH script would have completed this whole sequence of tasks automatically and in a very short period of time. I am writing to ask for the forum's help. Can this be done with BASH scripting, and if so, what commands should I learn about? Sample scripts are most welcome.
Thanks!
Mark
|
|
|
|
04-11-2012, 04:51 PM
|
#2
|
|
Senior Member
Registered: Mar 2012
Distribution: Red Hat
Posts: 1,380
|
Thats quite a confusing set of steps you are taking. To answer your questions, yes basically anything can be done with bash nowadays.
You will obviously want to learn about bash'es built in math functions. As far as getting the data from the columns there a few different ways to accomplish it such as a simple cut command or an in-depth awk statement. You can use the sort function to sort on the second column and then do some bash if statements to test values and such. These are some general pointers and I'm sure if I could see the data files in front of me(or an example) it would give me a better idea of what your final goal is that you are trying to accomplish but I am having a hard time following what the actual output of your steps was.
|
|
|
|
04-11-2012, 06:59 PM
|
#3
|
|
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 2,962
Rep: 
|
Quote:
Originally Posted by marcusrex
I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.
|
bash doesn't seem like the appropriate tool. I have no doubt it can be done and that there will be some creative awk+ bc atrocity to handle it, but it sounds like you need a real interpreted language. Do you have R installed? If so, here is a command-line script that might work for you. This assumes that the data is in CSV format. If not, it can be changed to use a different separator.
Code:
#!/usr/bin/Rscript --vanilla
#store the arguments passed to the script in a list
try(args <- commandArgs(TRUE),silent=TRUE)
#variable to store the table of means
all.means <- NULL
for (filename in args)
{
try({
#load the data from the file
values <- data.frame(read.csv(filename,header=FALSE))
#the stuff you described
values <- values[values[,2]>=100,]
values <- values[order(values[,2]),]
values <- cbind(values,values[,2]/values[,1])
#append the filename and mean for the current file to the table
all.means <- rbind(all.means,data.frame(filename=filename,mean=mean(values[,3])))
#save the revised data to a file (appending "~" to the end of the NEW filename)
write.table(values,file=paste(filename,'~',sep=''),row.names=FALSE,col.names=FALSE,sep=',')
})
}
#print the table of means to standard output (or replace 'stdout()' with a filename)
write.table(all.means,file=stdout(),row.names=FALSE,col.names=FALSE,sep=',')
Kevin Barry
PS If you have (or decide to install) R, Rscript might be in a different location, in which case you'd have to change /usr/bin/Rscript on the first line.
Last edited by ta0kira; 04-11-2012 at 07:03 PM.
|
|
|
|
04-11-2012, 07:02 PM
|
#4
|
|
Senior Member
Registered: Mar 2012
Distribution: Red Hat
Posts: 1,380
|
Ooo R script.. Thats a great language if you know it, python or perl would also do well. I would say pick one and learn it well don't try to learn a little about all of them. Python is my choice.
|
|
|
|
04-11-2012, 07:11 PM
|
#5
|
|
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 2,962
Rep: 
|
Quote:
Originally Posted by Kustom42
Ooo R script.. Thats a great language if you know it, python or perl would also do well. I would say pick one and learn it well don't try to learn a little about all of them. Python is my choice.
|
I'm learning python, but I've done a lot of math programming in R so I knew I'd be able to throw that together quickly. I'd also prefer C++ to bash for this one, but I'm not one to offer a compiled solution in response to a request for a script (as of today anyway; the past is a little hazy.)
Kevin Barry
|
|
|
|
04-11-2012, 08:51 PM
|
#6
|
|
Senior Member
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 4,554
|
The point is, I think ... Linux gives you dozens of most-excellent real programming languages to choose from. All of them are free. All of them are designed to do what bash is, quite honestly, not designed to do.
The first step in solving any problem is: choosing the right tool for the job. You simply don't earn any Brownie points for "wedging" an inappropriate round-peg into a square-hole, even if you happen to succeed.
|
|
|
|
04-11-2012, 10:29 PM
|
#7
|
|
Senior Member
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 2,962
Rep: 
|
Quote:
Originally Posted by sundialsvcs
The first step in solving any problem is: choosing the right tool for the job. You simply don't earn any Brownie points for "wedging" an inappropriate round-peg into a square-hole, even if you happen to succeed.
|
Easy for a programmer to say! For someone who isn't an experienced programmer, any programming task is like forcing a round peg into a square hole. That being said, everyone should become an experienced programmer.
Kevin Barry
|
|
|
|
04-12-2012, 02:07 AM
|
#8
|
|
LQ Newbie
Registered: Mar 2012
Posts: 13
Original Poster
Rep: 
|
Haha! I'm really enjoying LQ so far. Lot's of great responses and more importantly, good advice. Let's see here...a few people made reference to square pegs and round holes. I'm a pro scientist at the moment, which I mention to make it clear that I am completely used to pounding on square pegs. Anyhow, my specialty doesn't generally require me to write code. Usually there's a commercial or excellent open source software package that meets my needs. However, there are these little things, like the task I mentioned in my post, where automation would really help and there I have NO experience. Thank you for all of your advice so far.
In the back of my mind I was wondering if a real programming language like C++, python, etc, might be the better tool. One thing that really impresses me about UNIX and Linux is their standard toolchests. I thought maybe there were enough tools in BASH scripting that a wise computer literate person might choose that route over a programming language solution. Anyhow, I have learned that I have more to learn.
Thanks to all so far!
|
|
|
|
04-12-2012, 04:21 AM
|
#9
|
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 621
|
Quote:
Originally Posted by marcusrex
Hello,
I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.
1) Open each file and sort both columns of the data based upon column 2; the sort was ascending on column 2.
2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.
3) Divide each value in column 2 by it's corresponding value in column 1 and place the result in column 3. (For excel users, column C was =B2/A2)
4) Calculate the mean for column 3
5) Generate a simple two column table containing the file names in column 1 and its corresponding column 3 mean in column 2.
I am almost certain that if I knew how to do it, a BASH script would have completed this whole sequence of tasks automatically and in a very short period of time. I am writing to ask for the forum's help. Can this be done with BASH scripting, and if so, what commands should I learn about? Sample scripts are most welcome.
Thanks!
Mark
|
Just curious...
a) What's the point of sorting the values in step 1?
b) wouldn't it be more efficient to perform step 1 after step 2? (step 2 will be O(N), while step 1 will be O(N^2) or O(N log N) depending on your sorting algorithm)
c) to let the beast out of the cage, how about something like this? (not tested):
Code:
...
[ -e "$outfile" ] && rm "$outfile"
for file in $files; do
awk -v ifname="$file" -v ofname="$outfile" '
BEGIN { sum = 0; num = 0; }
$2 > 100 { x=$2/$1, sum+=x; num++; print $1, $2, x | sort -nk 2 }
END { print ifname, sum/num >> $outfile }
' > "${file}.sorted"
done
Last edited by millgates; 04-12-2012 at 04:47 AM.
|
|
|
1 members found this post helpful.
|
04-12-2012, 07:26 AM
|
#10
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,310
|
Maybe something like:
Code:
awk '$2 < 100{tot[FILENAME]+= $2 / $1;count[FILENAME]++}END{for(f in tot)print f,tot[f]/count[f]}' file1 file2 ...
|
|
|
1 members found this post helpful.
|
04-12-2012, 07:28 AM
|
#11
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 756
Rep: 
|
Quote:
Originally Posted by millgates
c) to let the beast out of the cage, how about something like this? (not tested)[code]
|
As a learning exercise I tried to test this code. I built five small sample input files named /home/daniel/Desktop/LQfiles/dbm330i01.txt through /home/daniel/Desktop/LQfiles/dbm330i05.txt. I embedded your code in a script named /home/daniel/Desktop/LQfiles/dbm330.bin.
As a first order of business I attempted to create a list of file names for the input files with this ...
Code:
files=ls /home/daniel/Desktop/LQfiles/dbm330i*.txt
... and a test execution resulted in this ...
Code:
daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm330.bin
/home/daniel/Desktop/LQfiles/dbm330.bin:
line 29: /home/daniel/Desktop/LQfiles/dbm330i01.txt: Permission denied
I am a raw newbie; this is the first time I ever used the ls command and know nothing about permissions. Where do I start?
Daniel B. Martin
|
|
|
|
04-12-2012, 08:11 AM
|
#12
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,310
|
Well the first rule to learn is generally you do not want to use ls, although in this case it may not have hurt. See here for reasons why.
Assuming we do not have to be concerned with word splitting, ie no whitespace is involved in file or path names, the simplest way to create you array would be:
Code:
files=( /home/daniel/Desktop/LQfiles/dbm330i*.txt )
The downside here is if there are no files of that pattern as the only item in the array will now be exactly what you have typed in, asterisk and all, which of course probably does not exist.
A safer alternative, which not only helps with the last scenario but also word splitting is to have the for loop use the same globbing construct:
Code:
for f in /home/daniel/Desktop/LQfiles/dbm330i*.txt
Hope that helps.
|
|
|
1 members found this post helpful.
|
04-12-2012, 08:26 AM
|
#13
|
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 621
|
Ok, there were quite a few bugs in my code, this fixed version should work, though:
Code:
#!/bin/bash
outfile='table.txt'
[ -e "$outfile" ] && rm "$outfile"
for file in [whatever matches your files]; do
awk -v ifname="$file" -v ofname="$outfile" '
BEGIN { sum = 0; num = 0; OFS="\t" }
$2 > 100 { x=$2/$1; sum+=x; num++; print $1"\t"$2"\t", x | "sort -nk 2" }
END { print ifname, sum/num >> ofname }
' "$file" > "${file}.sorted"
done
or, you can use the much shorter grail's solution.
|
|
|
1 members found this post helpful.
|
04-12-2012, 10:51 AM
|
#14
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,310
|
Or we could go bash with a side of bc
Code:
#!/bin/bash
for f in /path/to/files/*
do
tot=0
count=0
while read -r x y
do
if (( y < 100 ))
then
tot=$( echo "$tot + $y / $x" | bc -l )
(( count++ ))
fi
done<"$f"
mean=$( echo "$tot / $count" | bc -l )
echo -e "$f\t$mean"
done
|
|
|
1 members found this post helpful.
|
04-12-2012, 01:44 PM
|
#15
|
|
Member
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 621
|
or sed + bc  )
Code:
for f in file*; do
echo -e "$f\t$((sed -r 's_([0-9]+)\s([0-9]+)_if(\2>100){sum+=\2/\1;cnt+=1}_' "$f";echo "sum/cnt")|bc -l)"
done
|
|
|
1 members found this post helpful.
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 12:23 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|