LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-22-2008, 02:19 PM   #1
suomali
LQ Newbie
 
Registered: Sep 2008
Posts: 3

Rep: Reputation: 0
perl script to parse following format


I am a newbie to Linux and perl.I want to parse a file in following format

***A***
a#
b#
c#
a#
c#

***B***
a#
b#
a#

***C***
c#
b#
c#
a#

I want to know how to grep lines in section ***A***, and start with a#.I am not quite sure whether we can first extract paragraphs start with "***A
***" and end with "***B***". And then extract a# by using grep command. Could anyone please help me with this? Thanks.
 
Old 09-22-2008, 03:53 PM   #2
suomali
LQ Newbie
 
Registered: Sep 2008
Posts: 3

Original Poster
Rep: Reputation: 0
Could anyone help me with this? Thanks.
 
Old 09-22-2008, 03:59 PM   #3
CRC123
Member
 
Registered: Aug 2008
Distribution: opensuse, RHEL
Posts: 374
Blog Entries: 1

Rep: Reputation: 32
If you promise this is not homework you were suppose to do .

The man pages for grep have the answer you're looking for:
Code:
man grep
Look for -A -B and -C options.
 
Old 09-22-2008, 06:41 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,352

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
If you want a Perl soln, first show what you've tried so far..
 
Old 09-23-2008, 11:01 AM   #5
suomali
LQ Newbie
 
Registered: Sep 2008
Posts: 3

Original Poster
Rep: Reputation: 0
Sure. It's not a hw assignment, but someone shared his file, and I want to extract some statistics infomation from the file and place cross-correlation with my data. While I am not quite familiar with Linux, but I do try grep with -A -B -C. The problem with this option is that the target line may not be always the 2nd,3rd or even 100th line before or after the section ***A***, so I don't think I can use grep. Maybe I am wrong. While I also look into perl which I have no idea of,

open(FH, "filename");

$flag = 0;
while(<FH>)
{
if( /^\***A\***/ )
{
$flag = ("***A***");
}
elsif( $flag && /^a#/)
{
print $_;
}
}

close FH;

Poor programming and it doesn't work.....
 
Old 09-23-2008, 12:05 PM   #6
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,552

Rep: Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872
You have to escape all the *s I think
or use something like: /^\*+A\*+/
 
Old 09-23-2008, 07:18 PM   #7
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,352

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Ok good start. If the file is laid out as you say and you only want the matches between ***A*** and ***B***, I'd use a string comparison for that bit.

Always use the warnings and strict options as exemplified by the first 2 lines here. They'll save you a lot of stress.
Code:
#!/usr/bin/perl -w

use strict;             # Enforce declarations

my (
    $in_rec, $section
    );

$section = 0;
open(IN, "<", "t.t") or die "Unable to open t.t: $!\n";
while ( defined ( $in_rec = <IN> ) )
{
    chomp($in_rec);
    if( $in_rec eq '***A***' )
    {
        $section = 1;
    }
    elsif( $in_rec eq '***B***' )
    {
        last;
    }
    elsif( $section == 1 && $in_rec =~ /a#/ )
    {
        print "$in_rec\n";
    }
}
close(IN) or die "Unable to close t.t: $!\n";
Bookmark/read these
http://perldoc.perl.org/
http://www.perlmonks.org/?node=Tutorials
 
Old 09-23-2008, 09:10 PM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by suomali View Post
I am a newbie to Linux and perl.I want to parse a file in following format

***A***
a#
b#
c#
a#
c#

***B***
a#
b#
a#

***C***
c#
b#
c#
a#

I want to know how to grep lines in section ***A***, and start with a#.I am not quite sure whether we can first extract paragraphs start with "***A
***" and end with "***B***". And then extract a# by using grep command. Could anyone please help me with this? Thanks.
if you have Python and can use it, here's an alternative:
Code:
for lines in open("file"):
    if lines.startswith("***B"):
        flag=0
    if lines.startswith("***A"):
        flag=1
    if flag: 
        lines=lines.strip()
        if lines.startswith("a#"):
            print lines
 
Old 09-24-2008, 12:40 AM   #9
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,636
Blog Entries: 4

Rep: Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933Reputation: 3933
Another handy tool for cases like this is awk, which was certainly one of the inspirations for Perl.

awk "programs" are very simple: a regular expression (or: "string pattern"), followed by a block of statements that are to be executed when that particular pattern is matched. There's also a special-pattern that is "matched" at the start of the file, and another that is "matched" at the end of it, and a pattern that will "match" when nothing else does.

"And that's it."

So how can we use "it?" Well, there are basically three types of lines in your file:
  1. A line like "***A***", which will match the regular-expression /^\*{3}([A-Z])*{3}$/.
  2. A line like "a#", which will match /^([a-z]+)\#$/.
  3. A blank or empty line.

...

A "funny chicken-scratch" like /^\*{3}([A-Z])*{3}$/ is actually an incredibly-powerful thing, because it can not only match a particular string, but it can also extract information out of it, which you can then use in your awk-program. Let me break-down this regular-expression...
  1. The expression begins and ends with a forward-slash, "/".
  2. The "^" at the beginning of the expression anchors the match to the start-of-the-line, and the "$" at the end anchors it also to the end-of-line. In other words, we want to match only lines that consist of what matches this pattern, not that merely contain it.
  3. When I want to refer to a literal character, such as "*", that might have other meanings, I prefix the character with an "escape" ... a backslash.
  4. A term such as \*{3} means: "exactly three occurrences of the literal character, '*'."
  5. When I want to extract something, so that I can see exactly what text matched a particular part of the pattern, I enclose that part in "("parentheses")".
  6. The part that I want to extract, in this case, is "a single character in the inclusive range lower-case-'a' thru lower-case-'z'.

The expression /^([a-z]+)\#$/ uses "+" which means one-or-more but-at-least-one occurrence of... It uses parentheses to capture the text that precedes that '#' character.

awk provides a simple but very-serviceable programming language that you can use within the various blocks that are executed when the various patterns match. You can define variables to hold, for example, the "captured" parts of one string so that you can include them inside another. You can use "if"-statements to "do things only when it makes sense to do so."
 
Old 09-24-2008, 09:03 AM   #10
archtoad6
Senior Member
 
Registered: Oct 2004
Location: Houston, TX (usa)
Distribution: MEPIS, Debian, Knoppix,
Posts: 4,727
Blog Entries: 15

Rep: Reputation: 234Reputation: 234Reputation: 234
Nice exposition, sundialsvcs.


Quote:
Originally Posted by suomali View Post
I am a newbie to ... perl.
. . .
I want to know how to grep lines in section ***A***, and start with a#.
Is the point of your Q is to learn something about Perl, or do you just want to solve the problem? Both are valid goals. I can't help you w/ Perl, but I can offer some advice about grep/sed/awk solutions.

Since you mention grep, here are some grep-sed ideas.
(Note, F is the name of your file in all of the following):
Code:
cat $F  |\
sed -rn '/^\*{3}A\*{3}$/,/^\*{3}B\*{3}$/p'  | grep '^a#'
Unfortunately, this is visually complicated, unclear, ugly, inelegant; but it is short, & it does work.

The problem is that the asterisk is a special character in both shell globbing (wildcards) & regex's (regular expressions). I see 2 ways to clarify the code: dump the asterisks w/ tr, or put them in short variables:
Code:
## Using tr:
cat $F  | tr '*' '_'  |\
sed -rn '/^___A___$/,/^___B___$/p'  | grep '^a#'
Code:
## Using variables: 
A='^\*{3}A\*{3}$'
B='^\*{3}B\*{3}$'
C='^\*{3}C\*{3}$'
cat $F  | sed -rn "/$A/,/$B/p"  | grep '^a#'
## N.B. the change to double quotes in the sed expression
Making the variables solution still more general -- you can give any letter range:
Code:
for LL in {A..C}    # LL==LabelLetter
   do eval $LL='^\\*{3}'"$LL"'\\*{3}$'
done
cat $F  | sed -rn "/$A/,/$B/p"  | grep '^a#'
## N.B. still double quotes in the sed expression
You cannot further generalize this by changing "C" in the brace expansion to a variable, bash doesn't allow that. You must change "C" to the the new end of the range. For a longer explanation, do a case sensitive search for "Brace Expansion" in the bash man page.

Last edited by archtoad6; 09-24-2008 at 09:10 AM. Reason: add anchors, eliminate (accidentally) duplicated text
 
Old 09-24-2008, 10:01 AM   #11
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,552

Rep: Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872
The Perl Haters Community grows beyond the imagination limits
I mean each time there is an OP who wants to parse a file with perl, there is always someone who suggests to use another tool like awk or python, that must stop it now!

Sorry, just kidding
 
Old 09-24-2008, 01:51 PM   #12
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Given that the file is divided into sections by clearly distinguishable delimiters, it seems prudent to use these as the default record separators, and let Perl do the work of separating records:
Code:
$\ = "";
Now, you can iterate over each section. It seems logical that the original poster's file may actually extend beyond paragraph '***C***', way out to paragraph '***Z***', maybe even beyond. We'd like to avoid proliferating 'if - elsif' clauses.
Code:
while(<>){
   if( $_ =~ m/A/ ){

       #  We've found the record of interest, split it into fields, by line
       my @fields = split /\n/, $_;
       foreach $field( @fields ){
          if( $field =~ m/a/ ){
              print "$field\n";
          }
       }
   }
}
It seems reasonable that one might want to change the record of interest to '***G***' or '***W***' some day, and from here it is easy to make that a commandline parameter. If we expect that there is some connection between the '***A***', and the 'a#', we can accommodate that in a generalized way, too.
Code:
#! /usr/bin/perl -w
use strict;

$/ = "";

open(LQSUOMALI, shift ) or die "Cannot open data file: $!\n";
while(<LQSUOMALI>){
   if( $_ =~ m/$ARGV[0]/ ){

       #  We've found the record of interest, so split it into fields, by line
       my @fields = split /\n/, $_;
       foreach my $field( @fields ){
          my $fieldName = lc($ARGV[0] );
          if( $field =~ m/$fieldName/ ){
              print "$field\n";
          }
       }
   }
}
My perl offering.

--- rod.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
parse a string in perl 2007fld Programming 13 08-07-2007 02:41 PM
perl script to parse this file ohcarol Programming 10 11-02-2006 09:50 AM
optimizing perl parse file. eastsuse Programming 1 12-22-2004 02:49 AM
Need help with perl/bash script to parse PicBasic file cmfarley19 Programming 13 11-18-2004 05:06 PM
Parse a perl string djgerbavore Programming 3 10-31-2004 07:23 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:15 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration