Perl: file read strategy

R00ts · 06-30-2006, 02:37 PM

I have a file that looks similar to the following:

Code:

> K value
  25
> Iterations
  10
> Data properties:
  [0]: 502 194 102
  [1]: 9234  50  1899
  [2]: 95   908   145

What I need to do is use the line that begins with ">" to determine what variable(s) the data on the proceeding lines (that do not begin with ">") should store that data. (And notice that it is perfectly valid for a ">" statement to be followed by multiple lines of data, not just a single line).

I've got a really messy (but functional) implementation right now that uses a while loop on the file handle, checks if the line begins with ">" and if so, it sets a condition code. On the next loop if the condition code is set, it matches the line with the data that it expects to see for that condition code, and writes the data to the appropriate variable. But, there has to be a better way to do this I think.

(I'm not a Perl guru by any means). The current code is very cumbersome, especially when adding a new condition code, and furthermore this is an on-going project so the data file may make some changes. So with what little Perl background I know, I thought of two ideas:

#1: Save any ">" line to a temporary loop variable when one is found, and use that to process subsequent data. (The down-side to this is that these files are rather large, and not all of the data that follows after every ">" line needs to be read...)

#2: When a ">" line is read, continue reading in subsequent lines of the file to get all of the data, without requiring the loop condition to be re-evaluated. (Disadvantage: this could be kind of dangerous since I'd be reading in lines of the file from within segments of the loop in addition to the loop itself...).

I also thought of using some manipulation with the "next", "redo", "continue", "last" key-words to get around the condition code implementation I have currently, but from what I have read on perldoc.org this doesn't seem possible...

I know that TMTOWTDI, but I would like to know if any more experienced Perl programmers have a suggestion or recommendation on how to accomplish this task. I'm trying to learn more than just the basic operations about Perl as I am working on this project, and I would rather have code that is easier to modify, understand, and maintain than continue to have to support this hacked up solution that I initially created. Thanks!

jlinkels · 06-30-2006, 02:47 PM

What does "TMTOWTDI" mean?

I am not a PERL programmer, so maybe you are asking from some very PERL specific issues. I only know some general programming techniques.

But to me it appears that this is about designing an algorithm, not the PERL language.

If I have a situation like this (maybe a bit more complicated) I would implement different states, like:

st_reading_kval
st_reading_nritr
st_reading_data

A state would be initialized by reading "> something" and ended when you encounter "> something_else" Once you are in a state, read and process as appropriate until the state ends.

In this way you can handle quite complicated files.

If it is not worth implementing a state machine, maybe you did just fine with your "messy" code.

Or use Lex & Yacc if you can spend 2 years studying how these work.

jlinkels

R00ts · 06-30-2006, 02:55 PM

Quote:

Originally Posted by jlinkels

What does "TMTOWTDI" mean?

There's More Than One Way To Do It.

It is Perl's mantra.

Quote:

Originally Posted by jlinkels

A state would be initialized by reading "> something" and ended when you encounter "> something_else" Once you are in a state, read and process as appropriate until the state ends.

This is exactly what my condition code does (just substitute "state" with "condition code"). Sure a state machine is a pretty decent design, but it is a very cumbersome implementation in Perl in this particular case.

And yes, this is more of a "how would you do this in Perl" question than a general design question.

Quote:

Originally Posted by jlinkels

Or use Lex & Yacc if you can spend 2 years studying how these work.
jlinkels

Actually I did use lex and yacc about 3 years ago, but I've completely forgotten them since then.

spirit receiver · 06-30-2006, 05:50 PM

How about this? It reads your data from STDIN. There are three subroutines, with references to them stored in a hash where the hash keys correspond to the ">" lines.

Code:

#!/usr/bin/perl -w

use strict;

my %handles;

$handles{'K value'} = sub {
  my $content = shift;
  print "I just got a K value of ",$content,".\n";
};

$handles{'Iterations'} = sub {
  my $content = shift;
  print "There will be $content iterations.\n";
};

$handles{'Data properties:'} = sub {
  my $content = shift;
  if( $content =~ /\s*\[(\d)\]:\s+(\d+)\s+(\d+)\s+(\d+)/ ){
    print "The three arguments in line $1 are $2, $3, $4.\n";
  }
};

my $current_handle;

while( <> ){
  chomp;
  if( $_ =~ /^> (.*)\w*$/ ){
    $current_handle = $1;
    next;
  }
  &{$handles{$current_handle}}($_);
}

R00ts · 07-02-2006, 11:00 PM

Thanks spirit receiver, that seems like a nifty solution.

I'm still a little curious though, for the record is it possible to do something like this?

Code:

while (<FILE>) {
  if (m/^> some label/) {
    # read next line from FILE    
  }
  elsif (m/^> other label/) {
    while(1) {
      # read next line from FILE
      if (m/(\d+)/) {
        print "Read $1\n";
      } else {
        last; (end the loop)
      }
    }
  }
}

And of course, if any "read next line from FILE" statement fails because the EOF is reached, I would need to be able to detect that and abort/return from the subroutine. Is something like that possible/recommended in Perl?

spirit receiver · 07-03-2006, 02:51 AM

You'll run into trouble with that script: The while(1) loop will be finished once it retrieves a line that doesn't contain a digit. This line contains, say, "> some label". The script will continue with the outer loop, i.e. it will read the next line. This line will contain data, not a label, so it won't trigger any of the if clauses, and all subsequent lines will be ignored until the next label is reached.

homey · 07-03-2006, 07:48 AM

hi,

I wonder how to get a variable into the print line. For situations where the line isn't just three numbers.

Code:

> K value
  25
> Iterations
  10
> Data properties:
  [0]: 502 194 102
  [1]: 9234  50  1899 789
  [2]: 95   908   145 2567 456

Code:

$HANDLES{'Data properties:'} = sub {
  my $CONTENT = shift;
 if( $CONTENT =~ /\s*\[(\d+)\]:\s+(.*)/ ){
#  if( $CONTENT =~ /\s*\[(\d)\]:\s+(\d+)\s+(\d+)\s+(\d+)/ ){
    print "There are three arguments in line $1 are $2.\n";
#   print "The three arguments in line $1 are $2, $3, $4.\n";
  }
};

spirit receiver · 07-03-2006, 09:27 AM

I'd suggest splitting the line and to iterate over the resulting array:

Code:

foreach( split( '\s+', $line )) {
  printf( "The next value is %d.\n",$_ ) if ( /^\d+$/ );
};

homey · 07-03-2006, 10:22 AM

Thanks, I'll try that.
I was working on something like this...

Code:

$HANDLES{'Data properties:'} = sub {
  my $CONTENT = shift;
# get the number of args in each line
  my $COUNT = () = $CONTENT =~ /\s\w+/g;
if( $CONTENT =~ /\s*\[(\d+)\]:\s+(.*)/ ){
   print "The $COUNT arguments in line $1 are   $2\n";
 }
};

R00ts · 07-03-2006, 11:33 PM

Quote:

Originally Posted by spirit receiver

You'll run into trouble with that script: The while(1) loop will be finished once it retrieves a line that doesn't contain a digit. This line contains, say, "> some label". The script will continue with the outer loop, i.e. it will read the next line. This line will contain data, not a label, so it won't trigger any of the if clauses, and all subsequent lines will be ignored until the next label is reached.

Good point, I didn't think of that when I wrote that snippet. But I could easily just keep a $last_line variable that retains the last line read, couldn't I?

Even though I'd run into trouble with the script, is it possible to do? In other words what I really want to ask was: is it possible to read (or "peek") the next line of a file inside of a while(<FILE>) loop? (even if its not a good idea most of the time...)

spirit receiver · 07-04-2006, 04:19 AM

I'm not sure if I understood your question. Do you want to use an inner loop to read from the file without affecting the position where the outer loop will continue in the next pass? Then you'll have to restore the current position for the file handle using tell and seek. But this will only work with ordinary files, not with STDIN, for example.

R00ts · 07-04-2006, 04:45 PM

Quote:

Originally Posted by spirit receiver

I'm not sure if I understood your question. Do you want to use an inner loop to read from the file without affecting the position where the outer loop will continue in the next pass? Then you'll have to restore the current position for the file handle using tell and seek. But this will only work with ordinary files, not with STDIN, for example.

No, not necessarily an inner loop. Lets just say I want to do something simple like this:

Code:

while (<FILE>) {
    if (m/^>/) {
      my $string = # read the next line of file here
    }
}

I haven't been able to find anything that tells me whether that is possible or not (I'm not necessarily going to use it, I'm just incredibly curious at this point). If such a "read next line" exists without havint to do some tedious tell/seeking, on the next iteration through the loop after that "read next line of file" call has been made inside the if statement, would the while loop next get:

1) The same line that was previously read in the if statement?
2) The line that follows after the line that was previously read in the if statement?

Thanks once again.

spirit receiver · 07-04-2006, 06:10 PM

It will read the next line, i.e. 2). Each time you read from a file handle, it's current position will be changed, it doesn't matter where that reading takes place. Therefore, if you wanted 1) to happen, you'd have to store the current position using tell before reading in the if statement, and to restore it later with seek when you leave the if statement.

Edit: Maybe you're also asking how reading from the file in the if statement could be done? Simply by using "my $string = <FILE>;".

R00ts · 07-05-2006, 04:48 PM

Quote:

Originally Posted by spirit receiver

It will read the next line, i.e. 2). Each time you read from a file handle, it's current position will be changed, it doesn't matter where that reading takes place. Therefore, if you wanted 1) to happen, you'd have to store the current position using tell before reading in the if statement, and to restore it later with seek when you leave the if statement.

Edit: Maybe you're also asking how reading from the file in the if statement could be done? Simply by using "my $string = <FILE>;".

That is exactly the answer that I was looking for. I knew it was something simple! Thank you 5x spirit receiver.

bigearsbilly · 07-06-2006, 05:59 AM

I have split it into a hash of keys and values.
try this:

Code:

#!/usr/bin/perl -w

local $/ = "\n>";

@slurp = <>;
%slurp = map {split "\n", $_, 2} @slurp; # split each record into 2 and make a hash
print "\n'$k' = \n$v" while ($k,$v) = each(%slurp);