perl script to parse following format

suomali · 09-22-2008, 02:19 PM

I am a newbie to Linux and perl.I want to parse a file in following format

***A***
a#
b#
c#
a#
c#

***B***
a#
b#
a#

***C***
c#
b#
c#
a#

I want to know how to grep lines in section ***A***, and start with a#.I am not quite sure whether we can first extract paragraphs start with "***A
***" and end with "***B***". And then extract a# by using grep command. Could anyone please help me with this? Thanks.

suomali · 09-22-2008, 03:53 PM

Could anyone help me with this? Thanks.

CRC123 · 09-22-2008, 03:59 PM

If you promise this is not homework you were suppose to do

.

The man pages for grep have the answer you're looking for:

Code:

man grep

Look for -A -B and -C options.

chrism01 · 09-22-2008, 06:41 PM

If you want a Perl soln, first show what you've tried so far..

suomali · 09-23-2008, 11:01 AM

Sure. It's not a hw assignment, but someone shared his file, and I want to extract some statistics infomation from the file and place cross-correlation with my data. While I am not quite familiar with Linux, but I do try grep with -A -B -C. The problem with this option is that the target line may not be always the 2nd,3rd or even 100th line before or after the section ***A***, so I don't think I can use grep. Maybe I am wrong. While I also look into perl which I have no idea of,

open(FH, "filename");

$flag = 0;
while(<FH>)
{
if( /^\***A\***/ )
{
$flag = ("***A***");
}
elsif( $flag && /^a#/)
{
print $_;
}
}

close FH;

Poor programming and it doesn't work.....

keefaz · 09-23-2008, 12:05 PM

You have to escape all the *s I think
or use something like: /^\*+A\*+/

chrism01 · 09-23-2008, 07:18 PM

Ok good start. If the file is laid out as you say and you only want the matches between ***A*** and ***B***, I'd use a string comparison for that bit.

Always use the warnings and strict options as exemplified by the first 2 lines here. They'll save you a lot of stress.

Code:

#!/usr/bin/perl -w

use strict;             # Enforce declarations

my (
    $in_rec, $section
    );

$section = 0;
open(IN, "<", "t.t") or die "Unable to open t.t: $!\n";
while ( defined ( $in_rec = <IN> ) )
{
    chomp($in_rec);
    if( $in_rec eq '***A***' )
    {
        $section = 1;
    }
    elsif( $in_rec eq '***B***' )
    {
        last;
    }
    elsif( $section == 1 && $in_rec =~ /a#/ )
    {
        print "$in_rec\n";
    }
}
close(IN) or die "Unable to close t.t: $!\n";

Bookmark/read these
http://perldoc.perl.org/
http://www.perlmonks.org/?node=Tutorials

ghostdog74 · 09-23-2008, 09:10 PM

Quote:

Originally Posted by suomali

I am a newbie to Linux and perl.I want to parse a file in following format

***A***
a#
b#
c#
a#
c#

***B***
a#
b#
a#

***C***
c#
b#
c#
a#

I want to know how to grep lines in section ***A***, and start with a#.I am not quite sure whether we can first extract paragraphs start with "***A
***" and end with "***B***". And then extract a# by using grep command. Could anyone please help me with this? Thanks.

if you have Python and can use it, here's an alternative:

Code:

for lines in open("file"):
    if lines.startswith("***B"):
        flag=0
    if lines.startswith("***A"):
        flag=1
    if flag: 
        lines=lines.strip()
        if lines.startswith("a#"):
            print lines

sundialsvcs · 09-24-2008, 12:40 AM

Another handy tool for cases like this is awk, which was certainly one of the inspirations for Perl.

awk "programs" are very simple: a regular expression (or: "string pattern"), followed by a block of statements that are to be executed when that particular pattern is matched. There's also a special-pattern that is "matched" at the start of the file, and another that is "matched" at the end of it, and a pattern that will "match" when nothing else does.

"And that's it."

So how can we use "it?" Well, there are basically three types of lines in your file:

A line like "***A***", which will match the regular-expression /^\*{3}([A-Z])*{3}$/.
A line like "a#", which will match /^([a-z]+)\#$/.
A blank or empty line.

...

A "funny chicken-scratch" like /^\*{3}([A-Z])*{3}$/ is actually an incredibly-powerful thing, because it can not only match a particular string, but it can also extract information out of it, which you can then use in your awk-program. Let me break-down this regular-expression...

The expression begins and ends with a forward-slash, "/".
The "^" at the beginning of the expression anchors the match to the start-of-the-line, and the "$" at the end anchors it also to the end-of-line. In other words, we want to match only lines that consist of what matches this pattern, not that merely contain it.
When I want to refer to a literal character, such as "*", that might have other meanings, I prefix the character with an "escape" ... a backslash.
A term such as \*{3} means: "exactly three occurrences of the literal character, '*'."
When I want to extract something, so that I can see exactly what text matched a particular part of the pattern, I enclose that part in "("parentheses")".
The part that I want to extract, in this case, is "a single character in the inclusive range lower-case-'a' thru lower-case-'z'.

The expression /^([a-z]+)\#$/ uses "+" which means one-or-more but-at-least-one occurrence of... It uses parentheses to capture the text that precedes that '#' character.

awk provides a simple but very-serviceable programming language that you can use within the various blocks that are executed when the various patterns match. You can define variables to hold, for example, the "captured" parts of one string so that you can include them inside another. You can use "if"-statements to "do things only when it makes sense to do so."

archtoad6 · 09-24-2008, 09:03 AM

Nice exposition, sundialsvcs.

Quote:

Originally Posted by suomali

I am a newbie to ... perl.
. . .
I want to know how to grep lines in section ***A***, and start with a#.

Is the point of your Q is to learn something about Perl, or do you just want to solve the problem? Both are valid goals. I can't help you w/ Perl, but I can offer some advice about grep/sed/awk solutions.

Since you mention grep, here are some grep-sed ideas.
(Note, F is the name of your file in all of the following):

Code:

cat $F  |\
sed -rn '/^\*{3}A\*{3}$/,/^\*{3}B\*{3}$/p'  | grep '^a#'

Unfortunately, this is visually complicated, unclear, ugly, inelegant; but it is short, & it does work.

The problem is that the asterisk is a special character in both shell globbing (wildcards) & regex's (regular expressions). I see 2 ways to clarify the code: dump the asterisks w/ tr, or put them in short variables:

Code:

## Using tr:
cat $F  | tr '*' '_'  |\
sed -rn '/^___A___$/,/^___B___$/p'  | grep '^a#'

Code:

## Using variables: 
A='^\*{3}A\*{3}$'
B='^\*{3}B\*{3}$'
C='^\*{3}C\*{3}$'
cat $F  | sed -rn "/$A/,/$B/p"  | grep '^a#'
## N.B. the change to double quotes in the sed expression

Making the variables solution still more general -- you can give any letter range:

Code:

for LL in {A..C}    # LL==LabelLetter
   do eval $LL='^\\*{3}'"$LL"'\\*{3}$'
done
cat $F  | sed -rn "/$A/,/$B/p"  | grep '^a#'
## N.B. still double quotes in the sed expression

You cannot further generalize this by changing "C" in the brace expansion to a variable, bash doesn't allow that. You must change "C" to the the new end of the range. For a longer explanation, do a case sensitive search for "Brace Expansion" in the bash man page.

keefaz · 09-24-2008, 10:01 AM

The Perl Haters Community grows beyond the imagination limits

I mean each time there is an OP who wants to parse a file with perl, there is always someone who suggests to use another tool like awk or python, that must stop it now!

Sorry, just kidding

theNbomr · 09-24-2008, 01:51 PM

Given that the file is divided into sections by clearly distinguishable delimiters, it seems prudent to use these as the default record separators, and let Perl do the work of separating records:

Code:

$\ = "";

Now, you can iterate over each section. It seems logical that the original poster's file may actually extend beyond paragraph '***C***', way out to paragraph '***Z***', maybe even beyond. We'd like to avoid proliferating 'if - elsif' clauses.

Code:

while(<>){
   if( $_ =~ m/A/ ){

       #  We've found the record of interest, split it into fields, by line
       my @fields = split /\n/, $_;
       foreach $field( @fields ){
          if( $field =~ m/a/ ){
              print "$field\n";
          }
       }
   }
}

It seems reasonable that one might want to change the record of interest to '***G***' or '***W***' some day, and from here it is easy to make that a commandline parameter. If we expect that there is some connection between the '***A***', and the 'a#', we can accommodate that in a generalized way, too.

Code:

#! /usr/bin/perl -w
use strict;

$/ = "";

open(LQSUOMALI, shift ) or die "Cannot open data file: $!\n";
while(<LQSUOMALI>){
   if( $_ =~ m/$ARGV[0]/ ){

       #  We've found the record of interest, so split it into fields, by line
       my @fields = split /\n/, $_;
       foreach my $field( @fields ){
          my $fieldName = lc($ARGV[0] );
          if( $field =~ m/$fieldName/ ){
              print "$field\n";
          }
       }
   }
}

My perl offering.

--- rod.