[SOLVED] Using BASH to automate data processing and table generation.

AnanthaP · 04-12-2012, 07:23 PM

Incidentally, why excel?

Could well have been achieved in a linux environment itself using OO, Libre, gnumeric etc.

OK

danielbmartin · 04-12-2012, 08:43 PM

A round-up of test results.

Here are the five input files. The numeric values were chosen arbitrarily.

Code:

/home/daniel/Desktop/LQfiles/dbm330i01.txt
100 99
300 900
500 2500
200 200
200 8000

/home/daniel/Desktop/LQfiles/dbm330i02.txt
55 133
121 139
117 44
88 117
109 108

/home/daniel/Desktop/LQfiles/dbm330i03.txt
66 133
133 139
118 144
100 167
101 122

/home/daniel/Desktop/LQfiles/dbm330i04.txt
67 141
104 132
112 135
131 148
104 129

/home/daniel/Desktop/LQfiles/dbm330i05.txt
111 128
115 142
127 89
138 141
114 139

Here are the code examples and the results they produced.

Code:

echo
echo "Method of LQ guru grail (using awk)"
[ -e "$OutFile01" ] && rm "$OutFile01"
for file in /home/daniel/Desktop/LQfiles/dbm330i*.txt; do
awk '$2 < 100{tot[FILENAME]+= $2 / $1;count[FILENAME]++}END{for(f in tot)print f,tot[f]/count[f]}' $file >> $OutFile01
done

/home/daniel/Desktop/LQfiles/dbm330i01.txt 0.99
/home/daniel/Desktop/LQfiles/dbm330i02.txt 0.376068
/home/daniel/Desktop/LQfiles/dbm330i05.txt 0.700787

Code:

echo
echo "Method of LQ member millgates (using awk),"
echo " with suggested improvements of LQ guru grail"
[ -e "$OutFile02" ] && rm "$OutFile02"
for file in /home/daniel/Desktop/LQfiles/dbm330i*.txt; do
  awk -v ifname="$file" -v ofname="$OutFile02" '
    BEGIN { sum = 0; num = 0; OFS="\t" }
    $2 > 100 { x=$2/$1; sum+=x; num++;  print $1"\t"$2"\t", x | "sort -nk 2" }
    END { print ifname, sum/num >> ofname }
    ' "$file" > "${ifname}.sorted"
done

/home/daniel/Desktop/LQfiles/dbm330i01.txt      12.25
/home/daniel/Desktop/LQfiles/dbm330i02.txt      1.47183
/home/daniel/Desktop/LQfiles/dbm330i03.txt      1.4317
/home/daniel/Desktop/LQfiles/dbm330i04.txt      1.38984
/home/daniel/Desktop/LQfiles/dbm330i05.txt      1.15724

Code:

echo
echo "Method of LQ guru grail (using bash)"
[ -e "$OutFile03" ] && rm "$OutFile03"
for f in /home/daniel/Desktop/LQfiles/dbm330i*.txt
do
    tot=0
    count=0
    while read -r x y
    do
	if (( y < 100 ))
	then
	    tot=$( echo "$tot + $y / $x" | bc -l )
	    (( count++ ))
	fi
    done<"$f"
    mean=$( echo "$tot / $count" | bc -l )
    echo -e "$f\t$mean" >> $OutFile03
done

/home/daniel/Desktop/LQfiles/dbm330i01.txt      .99000000000000000000
/home/daniel/Desktop/LQfiles/dbm330i02.txt      .37606837606837606837
/home/daniel/Desktop/LQfiles/dbm330i03.txt      
/home/daniel/Desktop/LQfiles/dbm330i04.txt      
/home/daniel/Desktop/LQfiles/dbm330i05.txt      .70078740157480314960

Code:

echo
echo "Method of LQ member millgates (using sed+bc)"
files=( /home/daniel/Desktop/LQfiles/dbm330i*.txt )
[ -e "$OutFile04" ] && rm "$OutFile04"
for f in /home/daniel/Desktop/LQfiles/dbm330i*.txt; do
    echo -e "$f\t$((sed -r 's_([0-9]+)\s([0-9]+)_if(\2>100){sum+=\2/\1;cnt+=1}_' "$f";echo "sum/cnt")|bc -l)" >> $OutFile04
done

/home/daniel/Desktop/LQfiles/dbm330i01.txt      12.25000000000000000000
/home/daniel/Desktop/LQfiles/dbm330i02.txt      1.47182832284479490484
/home/daniel/Desktop/LQfiles/dbm330i03.txt      1.43170481444729154959
/home/daniel/Desktop/LQfiles/dbm330i04.txt      1.38984422635584763874
/home/daniel/Desktop/LQfiles/dbm330i05.txt      1.15724328447440575586

The results are not all alike. For those who contributed code: please examine my rendition of your post to make sure I didn't botch it.

Daniel B. Martin

ta0kira · 04-12-2012, 10:04 PM

Quote:

Originally Posted by danielbmartin

The results are not all alike. For those who contributed code: please examine my rendition of your post to make sure I didn't botch it.

You really botched mine! Here's its output with your example data (after adapting it to use space instead of comma to parse input):

Code:

"dbm330i01.txt",12.25
"dbm330i02.txt",1.47182832284479
"dbm330i03.txt",1.43170481444729
"dbm330i04.txt",1.38984422635585
"dbm330i05.txt",1.15724328447441

Does that mean I win?
Kevin Barry

grail · 04-13-2012, 12:30 AM

Well for mine you can see the 2 scripts output the same except digits after the decimal point, an easy fix on either side. The reason they will differ to the others is I used y < 100
and they used y > 100. This was driven from the second requirement:

Quote:

2) Delete all data points whose y-coordinate (column 2) was less than a specified value, in this case 100.

I of course read this wrong (or too quickly as is normally the case) and saw 100 and less than

So to have mine concur with the others is again a simple change:

Code:

#Awk
awk '$2 >= 100{tot[FILENAME]+= $2 / $1;count[FILENAME]++}END{for(f in tot)print f,tot[f]/count[f]}' /home/daniel/Desktop/LQfiles/dbm330i*.txt

#Bash
#!/bin/bash

for f in /home/daniel/Desktop/LQfiles/dbm330i*.txt
do
    tot=0
    count=0
    while read -r x y
    do
	if (( y >= 100 ))
	then
	    tot=$( echo "scale=6; $tot + $y / $x" | bc )
	    (( count++ ))
	fi
    done<"$f"
    mean=$( echo "scale=6; $tot / $count" | bc )
    echo -e "$f\t$mean" >> $OutFile03
done

danielbmartin · 04-13-2012, 09:20 AM

Test results, redux.

Thank you, grail, for minor corrections to your code. With those changes all results are equivalent.

Apologies to you, ta0kira, for omitting your code. No offense intended. As an inexperienced player in the Linux world, I had never even heard of Rscript. Consequently I was unable to understand your code, unable to execute it.

Let's hope that OP benefits from the many ideas presented in this thread. For sure, I did.

Thanks to all who contributed in any fashion.

Daniel B. Martin