LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Using BASH to automate data processing and table generation. (http://www.linuxquestions.org/questions/programming-9/using-bash-to-automate-data-processing-and-table-generation-939346/)

marcusrex 04-11-2012 04:29 PM

Using BASH to automate data processing and table generation.
 
Hello,

I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.

1) Open each file and sort both columns of the data based upon column 2; the sort was ascending on column 2.

2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.

3) Divide each value in column 2 by it's corresponding value in column 1 and place the result in column 3. (For excel users, column C was =B2/A2)

4) Calculate the mean for column 3

5) Generate a simple two column table containing the file names in column 1 and its corresponding column 3 mean in column 2.


I am almost certain that if I knew how to do it, a BASH script would have completed this whole sequence of tasks automatically and in a very short period of time. I am writing to ask for the forum's help. Can this be done with BASH scripting, and if so, what commands should I learn about? Sample scripts are most welcome.

Thanks!

Mark

Kustom42 04-11-2012 04:51 PM

Thats quite a confusing set of steps you are taking. To answer your questions, yes basically anything can be done with bash nowadays.

You will obviously want to learn about bash'es built in math functions. As far as getting the data from the columns there a few different ways to accomplish it such as a simple cut command or an in-depth awk statement. You can use the sort function to sort on the second column and then do some bash if statements to test values and such. These are some general pointers and I'm sure if I could see the data files in front of me(or an example) it would give me a better idea of what your final goal is that you are trying to accomplish but I am having a hard time following what the actual output of your steps was.

ta0kira 04-11-2012 06:59 PM

Quote:

Originally Posted by marcusrex (Post 4650507)
I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.

bash doesn't seem like the appropriate tool. I have no doubt it can be done and that there will be some creative awk+bc atrocity to handle it, but it sounds like you need a real interpreted language. Do you have R installed? If so, here is a command-line script that might work for you. This assumes that the data is in CSV format. If not, it can be changed to use a different separator.
Code:

#!/usr/bin/Rscript --vanilla

#store the arguments passed to the script in a list
try(args <- commandArgs(TRUE),silent=TRUE)

#variable to store the table of means
all.means <- NULL

for (filename in args)
{
  try({
    #load the data from the file
    values <- data.frame(read.csv(filename,header=FALSE))

    #the stuff you described
    values <- values[values[,2]>=100,]
    values <- values[order(values[,2]),]
    values <- cbind(values,values[,2]/values[,1])

    #append the filename and mean for the current file to the table
    all.means <- rbind(all.means,data.frame(filename=filename,mean=mean(values[,3])))

    #save the revised data to a file (appending "~" to the end of the NEW filename)
    write.table(values,file=paste(filename,'~',sep=''),row.names=FALSE,col.names=FALSE,sep=',')
  })
}

#print the table of means to standard output (or replace 'stdout()' with a filename)
write.table(all.means,file=stdout(),row.names=FALSE,col.names=FALSE,sep=',')

Kevin Barry

PS If you have (or decide to install) R, Rscript might be in a different location, in which case you'd have to change /usr/bin/Rscript on the first line.

Kustom42 04-11-2012 07:02 PM

Ooo R script.. Thats a great language if you know it, python or perl would also do well. I would say pick one and learn it well don't try to learn a little about all of them. Python is my choice.

ta0kira 04-11-2012 07:11 PM

Quote:

Originally Posted by Kustom42 (Post 4650593)
Ooo R script.. Thats a great language if you know it, python or perl would also do well. I would say pick one and learn it well don't try to learn a little about all of them. Python is my choice.

I'm learning python, but I've done a lot of math programming in R so I knew I'd be able to throw that together quickly. I'd also prefer C++ to bash for this one, but I'm not one to offer a compiled solution in response to a request for a script (as of today anyway; the past is a little hazy.)
Kevin Barry

sundialsvcs 04-11-2012 08:51 PM

The point is, I think ... Linux gives you dozens of most-excellent real programming languages to choose from. All of them are free. All of them are designed to do what bash is, quite honestly, not designed to do.

The first step in solving any problem is: choosing the right tool for the job. You simply don't earn any Brownie points for "wedging" an inappropriate round-peg into a square-hole, even if you happen to succeed.

ta0kira 04-11-2012 10:29 PM

Quote:

Originally Posted by sundialsvcs (Post 4650670)
The first step in solving any problem is: choosing the right tool for the job. You simply don't earn any Brownie points for "wedging" an inappropriate round-peg into a square-hole, even if you happen to succeed.

Easy for a programmer to say! For someone who isn't an experienced programmer, any programming task is like forcing a round peg into a square hole. That being said, everyone should become an experienced programmer.
Kevin Barry

marcusrex 04-12-2012 02:07 AM

Haha! I'm really enjoying LQ so far. Lot's of great responses and more importantly, good advice. Let's see here...a few people made reference to square pegs and round holes. I'm a pro scientist at the moment, which I mention to make it clear that I am completely used to pounding on square pegs. Anyhow, my specialty doesn't generally require me to write code. Usually there's a commercial or excellent open source software package that meets my needs. However, there are these little things, like the task I mentioned in my post, where automation would really help and there I have NO experience. Thank you for all of your advice so far.

In the back of my mind I was wondering if a real programming language like C++, python, etc, might be the better tool. One thing that really impresses me about UNIX and Linux is their standard toolchests. I thought maybe there were enough tools in BASH scripting that a wise computer literate person might choose that route over a programming language solution. Anyhow, I have learned that I have more to learn.

Thanks to all so far!

millgates 04-12-2012 04:21 AM

Quote:

Originally Posted by marcusrex (Post 4650507)
Hello,

I have a BASH scripting question because I am almost certain it could have helped me recently. I have twenty individual text files. Each text file contains the x,y coordinates of 20,000 datapoints. Each text file has two columns of 20,000 rows. The first column is the x-coordinate and the second column is the y-coordinate for a dataset. In may be not be obvious, but the x and y coordinates for each point have to "stay together". In brief, I performed the following sequence of actions manually using excel.

1) Open each file and sort both columns of the data based upon column 2; the sort was ascending on column 2.

2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.

3) Divide each value in column 2 by it's corresponding value in column 1 and place the result in column 3. (For excel users, column C was =B2/A2)

4) Calculate the mean for column 3

5) Generate a simple two column table containing the file names in column 1 and its corresponding column 3 mean in column 2.


I am almost certain that if I knew how to do it, a BASH script would have completed this whole sequence of tasks automatically and in a very short period of time. I am writing to ask for the forum's help. Can this be done with BASH scripting, and if so, what commands should I learn about? Sample scripts are most welcome.

Thanks!

Mark

Just curious...
a) What's the point of sorting the values in step 1?
b) wouldn't it be more efficient to perform step 1 after step 2? (step 2 will be O(N), while step 1 will be O(N^2) or O(N log N) depending on your sorting algorithm)
c) to let the beast out of the cage, how about something like this? (not tested):

Code:

...
[ -e "$outfile" ] && rm "$outfile"

for file in $files; do
    awk -v ifname="$file" -v ofname="$outfile" '
        BEGIN { sum = 0; num = 0; }
        $2 > 100 { x=$2/$1, sum+=x; num++;  print $1, $2, x | sort -nk 2 }
        END { print ifname, sum/num >> $outfile }
    ' > "${file}.sorted"
done


grail 04-12-2012 07:26 AM

Maybe something like:
Code:

awk '$2 < 100{tot[FILENAME]+= $2 / $1;count[FILENAME]++}END{for(f in tot)print f,tot[f]/count[f]}' file1 file2 ...

danielbmartin 04-12-2012 07:28 AM

Quote:

Originally Posted by millgates (Post 4650867)
c) to let the beast out of the cage, how about something like this? (not tested)[code]

As a learning exercise I tried to test this code. I built five small sample input files named /home/daniel/Desktop/LQfiles/dbm330i01.txt through /home/daniel/Desktop/LQfiles/dbm330i05.txt. I embedded your code in a script named /home/daniel/Desktop/LQfiles/dbm330.bin.

As a first order of business I attempted to create a list of file names for the input files with this ...
Code:

files=ls /home/daniel/Desktop/LQfiles/dbm330i*.txt
... and a test execution resulted in this ...
Code:

daniel@daniel-desktop:~$ bash /home/daniel/Desktop/LQfiles/dbm330.bin
/home/daniel/Desktop/LQfiles/dbm330.bin:
 line 29: /home/daniel/Desktop/LQfiles/dbm330i01.txt: Permission denied

I am a raw newbie; this is the first time I ever used the ls command and know nothing about permissions. Where do I start?

Daniel B. Martin

grail 04-12-2012 08:11 AM

Well the first rule to learn is generally you do not want to use ls, although in this case it may not have hurt. See here for reasons why.

Assuming we do not have to be concerned with word splitting, ie no whitespace is involved in file or path names, the simplest way to create you array would be:
Code:

files=( /home/daniel/Desktop/LQfiles/dbm330i*.txt )
The downside here is if there are no files of that pattern as the only item in the array will now be exactly what you have typed in, asterisk and all, which of course probably does not exist.
A safer alternative, which not only helps with the last scenario but also word splitting is to have the for loop use the same globbing construct:
Code:

for f in /home/daniel/Desktop/LQfiles/dbm330i*.txt
Hope that helps.

millgates 04-12-2012 08:26 AM

Ok, there were quite a few bugs in my code, this fixed version should work, though:
Code:

#!/bin/bash

outfile='table.txt'
[ -e "$outfile" ] && rm "$outfile"

for file in [whatever matches your files]; do
    awk -v ifname="$file" -v ofname="$outfile" '
        BEGIN { sum = 0; num = 0; OFS="\t" }
        $2 > 100 { x=$2/$1; sum+=x; num++;  print $1"\t"$2"\t", x | "sort -nk 2" }
        END { print ifname, sum/num >> ofname }
    ' "$file" > "${file}.sorted"
done

or, you can use the much shorter grail's solution.

grail 04-12-2012 10:51 AM

Or we could go bash with a side of bc :)
Code:

#!/bin/bash

for f in /path/to/files/*
do
    tot=0
    count=0

    while read -r x y
    do
        if (( y < 100 ))
        then
            tot=$( echo "$tot + $y / $x" | bc -l )
            (( count++ ))
        fi
    done<"$f"

    mean=$( echo "$tot / $count" | bc -l )

    echo -e "$f\t$mean"
done


millgates 04-12-2012 01:44 PM

or sed + bc :))
Code:

for f in file*; do
    echo -e "$f\t$((sed -r 's_([0-9]+)\s([0-9]+)_if(\2>100){sum+=\2/\1;cnt+=1}_' "$f";echo "sum/cnt")|bc -l)"
done



All times are GMT -5. The time now is 07:40 PM.