LinuxQuestions.org - Help working on a script to search for specific data.

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Help working on a script to search for specific data. (https://www.linuxquestions.org/questions/programming-9/help-working-on-a-script-to-search-for-specific-data-326054/)

oracle11112

05-22-2005 06:06 PM

Help working on a script to search for specific data.

Hello and thanks ahead of time for any help.

I'm working on a project for work, to take a text file generated by a Spectrum Analyzer, and read every 28th line into a file. The data looks like this:

05/02/05,17:40:40,1,853.26250,-120,853.83750,-85,854.51250 ...

the lines are actually much longer. but what matters is that it grabs data 28 times per second and I only need the data once per second so every 28th line is what I need. The data is just date,time,duration,data. The duration in above line is "1". That number counts up on each new line like

05/02/05,17:40:40,1,853.26250,-120,853.83750,-85,854.51250, ...
05/02/05,17:40:40,2,853.26250,-120,853.83750,-85,854.51250, ...
05/02/05,17:40:40,3,853.26250,-120,853.83750,-85,854.51250, ...
05/02/05,17:40:40,4,853.26250,-120,853.83750,-85,854.51250, ...

So I thought I would write a script to grab every 28th line and output it to a new file. This is what I came up with:

clear
line_number_count=0
echo "Enter the Input File Name] "
echo "Make sure to include the files Location Path "
read file_name_input
echo "Enter the Output File Name and path"
read output_file_name
echo "You Entered the Following Information "
echo "input file: $file_name_input "
echo "output file: $output_file_name "
echo "Is This Correct? Enter (yes,no) "
read yes_no
if ["$yes_no" = "yes"]
then
while ["$line_number_count" != 65000]
do
echo "processing"

grep ',[$line_number_count+28],' > "$output_file_name"
done
else
echo "You indicated that the path and or file name entered"
echo "was incorrect, the script has now exited. Please "
echo "rerun this script and enter the correct information."
echo "Have a Nice Day"

fi

surfice it to say it doesn't work, and I'm not a programer by trade. So any help would be appreciated

Tinkster

05-22-2005 06:54 PM

Replace the loop and grep with:

Code:

sed -n '1~28p' $inputfile > $outputfile

And no, I didn't use your variable names,
too long ;)

Cheers,
Tink

oracle11112

05-22-2005 10:09 PM

thanks I'll try it when I get back to work in the morning. I'll let you know how it works.

Tinkster

05-23-2005 08:35 PM

And, did it? :}

Cheers,
Tink

oracle11112

05-24-2005 08:10 PM

Yes it worked but it turned out I needed smothing that would do it at different Frequencies which scanned at different intervals, anywhere from 10 times a second all the way up to 1000. Unfortunatly I didn't know that until today when all the actual feild data came in. The test data was just to uniform. But I did get it to work by using the uniq command. and calling the first 17 characters which in my data was the date and time. With the uniq command the data gets stripped to that only one of each unique time is left.

Thanks for the help though, you sent me in the right direction.

I'll probably have another question at some point but I have to processes 67GB of data first.

Tinkster

05-24-2005 08:17 PM

Heh - that sounds like the amount of data we gathered from
echo sounders - just at a much higher res ;}

Cheers,
Tink

oracle11112

05-25-2005 06:49 PM

Fortunatly we're only scanning 20 specific channels for this project. I wrote a script in math lab to do all the work and it did work but took hours to compute a 250 mb file. But the uniq command does the same thing in seconds.

My next task is "one of many" is to take the data and pull out each frequency and power combination along with there time stamps. For instance

05/02/05,17:40:40,1,853.26250,-120
05/02/05,17:40:40,2,853.26250,-120

and

05/02/05,17:40:40,1,854.26250,-96
05/02/05,17:40:40,2,854.26250,-96

Where the first part is the DATE, TIME and Duration

05/02/05,17:40:40,1
05/02/05,17:40:40,2

and the second part is the frequency and power

853.26250,-120

The probem of course being that the program records all the frequencies on one line like

date, time, duration, freq1, power1, freq2, power2, freq3, power3,... and so on

So seperating them out will be a trick. Which I have done in math lab but it take a while to process and shaveing time of doing it in a shell script means that when the real data comes in I'll be able to process it within a day or two instead of a week or two.

I'm thinking of doing a grep for the frequencies since they don't change but that grabs the whole line and not just the chunk I need. If you've got any suggestions I'd be glad to take them.

Tinkster

05-25-2005 07:01 PM

In that scenario of separating those long lines out: do you
need to prepend every frequency/power-level pair with the
date, time, duration tupel?

Either way, awk seems like a good tool ;}

awk -F, '{for(i=4; i < NF; i+=2){printf "(formatting here)\n", $1, $2,$3,$i,$(i+1)}}' file

You get the idea :}

[edit]
actual data:
data.txt

Code:

05/02/05,17:40:40,1,853.26250,-120,53.26250,-10,83.250,-20,853.2,-70

awk-bit:

Code:

awk -F, '{for(i=4; i < NF; i+=2){printf "%9s %8s %2d %-8e %-8e\n", $1, 2,$3,$i,$(i+1)}}' data.txt

And the output:

Code:

 05/02/05 17:40:40  1 8.532625e+02 -1.200000e+02

 05/02/05 17:40:40  1 5.326250e+01 -1.000000e+01

 05/02/05 17:40:40  1 8.325000e+01 -2.000000e+01

 05/02/05 17:40:40  1 8.532000e+02 -7.000000e+01

[/edit]

Cheers,
Tink

oracle11112

05-25-2005 09:31 PM

It does work, but is there a way to output each of the same frequencies from each line to a file so that I start with

date, time, duration, freq1, power1, freq2, power2, freq3, power3,... freq 20, power20

and I end up with 20 files each with

File 1:
date, time, duration, freq1, power1

File 2:
date, time, duration, freq2, power2

File 3:
date, time, duration, freq3, power3

and so on?

PS: I fully intend to "donate" ie $ for you're time, you've been a big help so far.

jschiwal

05-25-2005 09:36 PM

From your description, you may want to use awk to extract the information.
Awk is better at handling and processing files consisting of records.
It also has a BEGIN block for preprocessing and an END block for post processing.

But mostly, because a particular field can be selected easily, such as
{ print $3 log($4) $5 log($4) }

Using sed, you would need to use the pattern of the line to be able to output selected fields to a file.

Code:

1~28s/\(<pattern matching common part>\)(\<freq1 data1 pattern>\)\(<freq2 data2 pattern>\)\(freq3 data3 pattern\)/\1 \2/w freq1_file

This would write the substitution to a file "freq1_file"
Because subsequent sed commands would operate on the changed line, you need to first save the line read, and then before each additional substitution, read the original line from the hold space.

Code:

1~28{

            # save the original line

            h

            # output the freq1 data file info

            1~28s/\(<pattern matching common part>\)(\<freq1 data1 pattern>\)\(<freq2 data2 pattern>\)\(freq3 data3 pattern\)/\1 \2/w freq1_file

            # retrieve the original line

            g

            # output the freq2 data file info

            1~28s/\(<pattern matching common part>\)(\<freq1 data1 pattern>\)\(<freq2 data2 pattern>\)\(freq3 data3 pattern\)/\1 \3/w freq2_file

            # retrieve the original line

            g

            # output the freq3 data file info

            1~28s/\(<pattern matching common part>\)(\<freq1 data1 pattern>\)\(<freq2 data2 pattern>\)\(freq3 data3 pattern\)/\1 \4/w freq3_file

        }

Tinkster

05-25-2005 09:42 PM

That change is next to trivial :)

Code:

awk -F, '{for(i=4; i < NF; i+=2){name+=1; printf "%9s %8s %2d %-8e %-8e\n", $1, 2,$3,$i,$(i+1) > name}}' data.txt

This will give you numerical filenames in increments of 1

Cheers,
Tink

jschiwal

05-25-2005 10:14 PM

Tinksters reply was submitted while I was writing the mine. You may A my sed example was not trivial. That is why awk may be better.
The difference is due to being able to access fields of each line using $<field_number> whereas with sed, I used positional parameters $, $, \2, \3 to store and replace parts.

Another difference is that with sed, what is written is the result of the replacement. So saving the original line in the hold register is necessary to be able to perform a different substitution for the next frequency/data fields.

However, since you are working on a very large dataset, you might try both. One approach may be faster than the other.

A totally different approach may be to read each line into a bash array variable. This approach may be even faster because you are not executing a program. However, since the every line part of the loop is contained inside of the sed or awk program, it wouldn't save that much time.

One feature of awk hasn't been mentioned yet. Awk handles floating point variables. Together with the built in arithmetic functions ( including trig and log functions ), you could for example read in a configuration file in the BEGIN portion containing scaling information for each sensor, then use these values to convert/normalize the raw data from each sensor. This might speed up the post processing phase.

Imagine writing your awk program as a filter to split and process your data in real-time. Sounds neat!

oracle11112

05-25-2005 10:21 PM

Tink,

ok that was my fult I didn't explain it clearly enough.

I should have said that I wanted there to be 20 files in the end on containing a list of all the frequency scans over a period of time like

File 1:
date, time, duration, freq1, power1
date, time2, duration2, freq1,power1
data, time3, duration3, freq1, power1

File 2:
date, time, duration, freq2, power2
date, time2, duration2, freq2,power2
data, time3, duration3, freq2, power2

File 3:
date, time, duration, freq3, power3
date, time2, duration2, freq3,power3
data, time3, duration3, freq3, power3

and so on?

The script you posted works accept it creates a new file for every one line item, which isn't going to work for my final calculations.

Here's a real line of data from the SA:

05/02/05,17:40:41,22,853.26250,-120,853.83750,-84,854.51250,-120,855.21250,-120,855.71250,-120,868.60000,-120,868.91250,-121,869.00000,-121,867.00000,-121,868.00000,-121
05/02/05,17:40:53,345,853.26250,-120,853.83750,-83,854.51250,-120,855.21250,-120,855.71250,-120,868.60000,-121,868.91250,-121,869.00000,-121,867.00000,-121,868.00000,-121
05/02/05,17:40:54,372,853.26250,-120,853.83750,-82,854.51250,-120,855.21250,-120,855.71250,-120,868.60000,-121,868.91250,-121,869.00000,-121,867.00000,-121,868.00000,-121
05/02/05,17:40:55,399,853.26250,-120,853.83750,-83,854.51250,-120,855.21250,-119,855.71250,-120,868.60000,-121,868.91250,-119,869.00000,-121,867.00000,-121,868.00000,-121
05/02/05,17:40:56,427,853.26250,-120,853.83750,-84,854.51250,-120,855.21250,-120,855.71250,-120,868.60000,-121,868.91250,-121,869.00000,-121,867.00000,-121,868.00000,-121

So there it's made up of
data, time, duration, freq1, power1, freq2, power2 and so on. This particular example has 10 frequncies "ie: 853 and so on" and 10 matching powers for those frequencies "ie: -120"

every line in the files I'm getting is like this. What I need to do at this point is take the corosponding frequencies in each line like
853.26350 and put it into a file along with it's time and and date. For this example I would end of with

File 1:
05/02/05,17:40:41,853.26250,-120
05/02/05,17:40:53,853.26250,-120
05/02/05,17:40:54,853.26250,-120
05/02/05,17:40:55,853.26250,-120

File 2:
05/02/05,17:40:41,853.83750,-120
05/02/05,17:40:53,853.83750,-120
05/02/05,17:40:54,853.83750,-120
05/02/05,17:40:55,853.83750,-120

and so on. I know it's a lot to ask but you seem like a total guru.

Tinkster

05-25-2005 11:01 PM

Code:

awk -F, '{for(i=4; i < NF; i+=2){printf "%9s %8s %2d %-8f %-8f\n", $1, $2,$3,$i,$(i+1) >> (i/2-1) }} ' file

Hope I got you right this time..
And I don't know about the guru, but thanks ;}

Cheers,
Tink

oracle11112

05-25-2005 11:07 PM

thanks I'm going to try it latter, I need some sleep, I have to be back to work in 6.5 hrs. It's been a long day of meetings and then programing and test set design. I'll let you know in the morning when I'm logged back into everything.

Thanks again for you're help. :)

- Oracle111122

All times are GMT -5. The time now is 12:54 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page