LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-11-2012, 04:29 PM   #1
marcusrex
LQ Newbie
 
Registered: Mar 2012
Posts: 15

Rep: Reputation: Disabled
Using BASH to automate data processing and table generation.


Hello,

I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.

1) Open each file and sort both columns of the data based upon column 2; the sort was ascending on column 2.

2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.

3) Divide each value in column 2 by it's corresponding value in column 1 and place the result in column 3. (For excel users, column C was =B2/A2)

4) Calculate the mean for column 3

5) Generate a simple two column table containing the file names in column 1 and its corresponding column 3 mean in column 2.


I am almost certain that if I knew how to do it, a BASH script would have completed this whole sequence of tasks automatically and in a very short period of time. I am writing to ask for the forum's help. Can this be done with BASH scripting, and if so, what commands should I learn about? Sample scripts are most welcome.

Thanks!

Mark
 
Old 04-11-2012, 04:51 PM   #2
Kustom42
Senior Member
 
Registered: Mar 2012
Distribution: Red Hat
Posts: 1,565

Rep: Reputation: 410Reputation: 410Reputation: 410Reputation: 410Reputation: 410
Thats quite a confusing set of steps you are taking. To answer your questions, yes basically anything can be done with bash nowadays.

You will obviously want to learn about bash'es built in math functions. As far as getting the data from the columns there a few different ways to accomplish it such as a simple cut command or an in-depth awk statement. You can use the sort function to sort on the second column and then do some bash if statements to test values and such. These are some general pointers and I'm sure if I could see the data files in front of me(or an example) it would give me a better idea of what your final goal is that you are trying to accomplish but I am having a hard time following what the actual output of your steps was.
 
Old 04-11-2012, 06:59 PM   #3
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by marcusrex View Post
I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.
bash doesn't seem like the appropriate tool. I have no doubt it can be done and that there will be some creative awk+bc atrocity to handle it, but it sounds like you need a real interpreted language. Do you have R installed? If so, here is a command-line script that might work for you. This assumes that the data is in CSV format. If not, it can be changed to use a different separator.
Code:
#!/usr/bin/Rscript --vanilla

#store the arguments passed to the script in a list
try(args <- commandArgs(TRUE),silent=TRUE)

#variable to store the table of means
all.means <- NULL

for (filename in args)
{
  try({
    #load the data from the file
    values <- data.frame(read.csv(filename,header=FALSE))

    #the stuff you described
    values <- values[values[,2]>=100,]
    values <- values[order(values[,2]),]
    values <- cbind(values,values[,2]/values[,1])

    #append the filename and mean for the current file to the table
    all.means <- rbind(all.means,data.frame(filename=filename,mean=mean(values[,3])))

    #save the revised data to a file (appending "~" to the end of the NEW filename)
    write.table(values,file=paste(filename,'~',sep=''),row.names=FALSE,col.names=FALSE,sep=',')
  })
}

#print the table of means to standard output (or replace 'stdout()' with a filename)
write.table(all.means,file=stdout(),row.names=FALSE,col.names=FALSE,sep=',')
Kevin Barry

PS If you have (or decide to install) R, Rscript might be in a different location, in which case you'd have to change /usr/bin/Rscript on the first line.

Last edited by ta0kira; 04-11-2012 at 07:03 PM.
 
Old 04-11-2012, 07:02 PM   #4
Kustom42
Senior Member
 
Registered: Mar 2012
Distribution: Red Hat
Posts: 1,565

Rep: Reputation: 410Reputation: 410Reputation: 410Reputation: 410Reputation: 410
Ooo R script.. Thats a great language if you know it, python or perl would also do well. I would say pick one and learn it well don't try to learn a little about all of them. Python is my choice.
 
Old 04-11-2012, 07:11 PM   #5
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by Kustom42 View Post
Ooo R script.. Thats a great language if you know it, python or perl would also do well. I would say pick one and learn it well don't try to learn a little about all of them. Python is my choice.
I'm learning python, but I've done a lot of math programming in R so I knew I'd be able to throw that together quickly. I'd also prefer C++ to bash for this one, but I'm not one to offer a compiled solution in response to a request for a script (as of today anyway; the past is a little hazy.)
Kevin Barry
 
Old 04-11-2012, 08:51 PM   #6
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,330

Rep: Reputation: 1100Reputation: 1100Reputation: 1100Reputation: 1100Reputation: 1100Reputation: 1100Reputation: 1100Reputation: 1100Reputation: 1100
The point is, I think ... Linux gives you dozens of most-excellent real programming languages to choose from. All of them are free. All of them are designed to do what bash is, quite honestly, not designed to do.

The first step in solving any problem is: choosing the right tool for the job. You simply don't earn any Brownie points for "wedging" an inappropriate round-peg into a square-hole, even if you happen to succeed.
 
Old 04-11-2012, 10:29 PM   #7
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by sundialsvcs View Post
The first step in solving any problem is: choosing the right tool for the job. You simply don't earn any Brownie points for "wedging" an inappropriate round-peg into a square-hole, even if you happen to succeed.
Easy for a programmer to say! For someone who isn't an experienced programmer, any programming task is like forcing a round peg into a square hole. That being said, everyone should become an experienced programmer.
Kevin Barry
 
Old 04-12-2012, 02:07 AM   #8
marcusrex
LQ Newbie
 
Registered: Mar 2012
Posts: 15

Original Poster
Rep: Reputation: Disabled
Haha! I'm really enjoying LQ so far. Lot's of great responses and more importantly, good advice. Let's see here...a few people made reference to square pegs and round holes. I'm a pro scientist at the moment, which I mention to make it clear that I am completely used to pounding on square pegs. Anyhow, my specialty doesn't generally require me to write code. Usually there's a commercial or excellent open source software package that meets my needs. However, there are these little things, like the task I mentioned in my post, where automation would really help and there I have NO experience. Thank you for all of your advice so far.

In the back of my mind I was wondering if a real programming language like C++, python, etc, might be the better tool. One thing that really impresses me about UNIX and Linux is their standard toolchests. I thought maybe there were enough tools in BASH scripting that a wise computer literate person might choose that route over a programming language solution. Anyhow, I have learned that I have more to learn.

Thanks to all so far!
 
Old 04-12-2012, 04:21 AM   #9
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
Quote:
Originally Posted by marcusrex View Post
Hello,

I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.

1) Open each file and sort both columns of the data based upon column 2; the sort was ascending on column 2.

2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.

3) Divide each value in column 2 by it's corresponding value in column 1 and place the result in column 3. (For excel users, column C was =B2/A2)

4) Calculate the mean for column 3

5) Generate a simple two column table containing the file names in column 1 and its corresponding column 3 mean in column 2.


I am almost certain that if I knew how to do it, a BASH script would have completed this whole sequence of tasks automatically and in a very short period of time. I am writing to ask for the forum's help. Can this be done with BASH scripting, and if so, what commands should I learn about? Sample scripts are most welcome.

Thanks!

Mark
Just curious...
a) What's the point of sorting the values in step 1?
b) wouldn't it be more efficient to perform step 1 after step 2? (step 2 will be O(N), while step 1 will be O(N^2) or O(N log N) depending on your sorting algorithm)
c) to let the beast out of the cage, how about something like this? (not tested):

Code:
...
[ -e "$outfile" ] && rm "$outfile"

for file in $files; do
    awk -v ifname="$file" -v ofname="$outfile" '
        BEGIN { sum = 0; num = 0; }
        $2 > 100 { x=$2/$1, sum+=x; num++;  print $1, $2, x | sort -nk 2 }
        END { print ifname, sum/num >> $outfile }
    ' > "${file}.sorted"
done

Last edited by millgates; 04-12-2012 at 04:47 AM.
 
1 members found this post helpful.
Old 04-12-2012, 07:26 AM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Maybe something like:
Code:
awk '$2 < 100{tot[FILENAME]+= $2 / $1;count[FILENAME]++}END{for(f in tot)print f,tot[f]/count[f]}' file1 file2 ...
 
1 members found this post helpful.
Old 04-12-2012, 07:28 AM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,083

Rep: Reputation: 287Reputation: 287Reputation: 287
Quote:
Originally Posted by millgates View Post
c) to let the beast out of the cage, how about something like this? (not tested)[code]
As a learning exercise I tried to test this code. I built five small sample input files named /home/daniel/Desktop/LQfiles/dbm330i01.txt through /home/daniel/Desktop/LQfiles/dbm330i05.txt. I embedded your code in a script named /home/daniel/Desktop/LQfiles/dbm330.bin.

As a first order of business I attempted to create a list of file names for the input files with this ...
Code:
files=ls /home/daniel/Desktop/LQfiles/dbm330i*.txt
... and a test execution resulted in this ...
Code:
daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm330.bin
/home/daniel/Desktop/LQfiles/dbm330.bin:
 line 29: /home/daniel/Desktop/LQfiles/dbm330i01.txt: Permission denied
I am a raw newbie; this is the first time I ever used the ls command and know nothing about permissions. Where do I start?

Daniel B. Martin
 
Old 04-12-2012, 08:11 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Well the first rule to learn is generally you do not want to use ls, although in this case it may not have hurt. See here for reasons why.

Assuming we do not have to be concerned with word splitting, ie no whitespace is involved in file or path names, the simplest way to create you array would be:
Code:
files=( /home/daniel/Desktop/LQfiles/dbm330i*.txt )
The downside here is if there are no files of that pattern as the only item in the array will now be exactly what you have typed in, asterisk and all, which of course probably does not exist.
A safer alternative, which not only helps with the last scenario but also word splitting is to have the for loop use the same globbing construct:
Code:
for f in /home/daniel/Desktop/LQfiles/dbm330i*.txt
Hope that helps.
 
1 members found this post helpful.
Old 04-12-2012, 08:26 AM   #13
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
Ok, there were quite a few bugs in my code, this fixed version should work, though:
Code:
#!/bin/bash

outfile='table.txt'
[ -e "$outfile" ] && rm "$outfile"

for file in [whatever matches your files]; do
    awk -v ifname="$file" -v ofname="$outfile" '
        BEGIN { sum = 0; num = 0; OFS="\t" }
        $2 > 100 { x=$2/$1; sum+=x; num++;  print $1"\t"$2"\t", x | "sort -nk 2" }
	END { print ifname, sum/num >> ofname }
    ' "$file" > "${file}.sorted"
done
or, you can use the much shorter grail's solution.
 
1 members found this post helpful.
Old 04-12-2012, 10:51 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Or we could go bash with a side of bc
Code:
#!/bin/bash

for f in /path/to/files/*
do
    tot=0
    count=0

    while read -r x y
    do
	if (( y < 100 ))
	then
	    tot=$( echo "$tot + $y / $x" | bc -l )
	    (( count++ ))
	fi
    done<"$f"

    mean=$( echo "$tot / $count" | bc -l )

    echo -e "$f\t$mean"
done
 
1 members found this post helpful.
Old 04-12-2012, 01:44 PM   #15
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
or sed + bc )
Code:
for f in file*; do
    echo -e "$f\t$((sed -r 's_([0-9]+)\s([0-9]+)_if(\2>100){sum+=\2/\1;cnt+=1}_' "$f";echo "sum/cnt")|bc -l)"
done
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Pipelining data of find command in an array after processing the contents of the data AshishJogeshwar Linux - Software 5 06-10-2010 01:52 AM
Pipelining data of find command in an array after processing the contents of the data AshishJogeshwar Linux - Software 0 06-07-2010 06:15 AM
automate SSL certificate generation jakev383 Linux - Server 4 07-24-2009 08:50 AM
Processing data from a 'foreign' database with mysql, or tools to pre-process data. linker3000 Linux - Software 1 08-14-2007 08:36 PM
perl CGI + dynamic table generation kurrupt Programming 8 09-27-2006 10:25 PM


All times are GMT -5. The time now is 03:14 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration