ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am a newbie to Linux and perl.I want to parse a file in following format
***A***
a#
b#
c#
a#
c#
***B***
a#
b#
a#
***C***
c#
b#
c#
a#
I want to know how to grep lines in section ***A***, and start with a#.I am not quite sure whether we can first extract paragraphs start with "***A
***" and end with "***B***". And then extract a# by using grep command. Could anyone please help me with this? Thanks.
Sure. It's not a hw assignment, but someone shared his file, and I want to extract some statistics infomation from the file and place cross-correlation with my data. While I am not quite familiar with Linux, but I do try grep with -A -B -C. The problem with this option is that the target line may not be always the 2nd,3rd or even 100th line before or after the section ***A***, so I don't think I can use grep. Maybe I am wrong. While I also look into perl which I have no idea of,
I am a newbie to Linux and perl.I want to parse a file in following format
***A***
a#
b#
c#
a#
c#
***B***
a#
b#
a#
***C***
c#
b#
c#
a#
I want to know how to grep lines in section ***A***, and start with a#.I am not quite sure whether we can first extract paragraphs start with "***A
***" and end with "***B***". And then extract a# by using grep command. Could anyone please help me with this? Thanks.
if you have Python and can use it, here's an alternative:
Code:
for lines in open("file"):
if lines.startswith("***B"):
flag=0
if lines.startswith("***A"):
flag=1
if flag:
lines=lines.strip()
if lines.startswith("a#"):
print lines
Another handy tool for cases like this is awk, which was certainly one of the inspirations for Perl.
awk "programs" are very simple: a regular expression (or: "string pattern"), followed by a block of statements that are to be executed when that particular pattern is matched. There's also a special-pattern that is "matched" at the start of the file, and another that is "matched" at the end of it, and a pattern that will "match" when nothing else does.
"And that's it."
So how can we use "it?" Well, there are basically three types of lines in your file:
A line like "***A***", which will match the regular-expression /^\*{3}([A-Z])*{3}$/.
A line like "a#", which will match /^([a-z]+)\#$/.
A blank or empty line.
...
A "funny chicken-scratch" like /^\*{3}([A-Z])*{3}$/ is actually an incredibly-powerful thing, because it can not only match a particular string, but it can also extract information out of it, which you can then use in your awk-program. Let me break-down this regular-expression...
The expression begins and ends with a forward-slash, "/".
The "^" at the beginning of the expression anchors the match to the start-of-the-line, and the "$" at the end anchors it also to the end-of-line. In other words, we want to match only lines that consist of what matches this pattern, not that merely contain it.
When I want to refer to a literal character, such as "*", that might have other meanings, I prefix the character with an "escape" ... a backslash.
A term such as \*{3} means: "exactly three occurrences of the literal character, '*'."
When I want to extract something, so that I can see exactly what text matched a particular part of the pattern, I enclose that part in "("parentheses")".
The part that I want to extract, in this case, is "a single character in the inclusive range lower-case-'a' thru lower-case-'z'.
The expression /^([a-z]+)\#$/ uses "+" which means one-or-more but-at-least-one occurrence of... It uses parentheses to capture the text that precedes that '#' character.
awk provides a simple but very-serviceable programming language that you can use within the various blocks that are executed when the various patterns match. You can define variables to hold, for example, the "captured" parts of one string so that you can include them inside another. You can use "if"-statements to "do things only when it makes sense to do so."
I am a newbie to ... perl.
. . .
I want to know how to grep lines in section ***A***, and start with a#.
Is the point of your Q is to learn something about Perl, or do you just want to solve the problem? Both are valid goals. I can't help you w/ Perl, but I can offer some advice about grep/sed/awk solutions.
Since you mention grep, here are some grep-sed ideas.
(Note, F is the name of your file in all of the following):
Code:
cat $F |\
sed -rn '/^\*{3}A\*{3}$/,/^\*{3}B\*{3}$/p' | grep '^a#'
Unfortunately, this is visually complicated, unclear, ugly, inelegant; but it is short, & it does work.
The problem is that the asterisk is a special character in both shell globbing (wildcards) & regex's (regular expressions). I see 2 ways to clarify the code: dump the asterisks w/ tr, or put them in short variables:
Code:
## Using tr:
cat $F | tr '*' '_' |\
sed -rn '/^___A___$/,/^___B___$/p' | grep '^a#'
Code:
## Using variables:
A='^\*{3}A\*{3}$'
B='^\*{3}B\*{3}$'
C='^\*{3}C\*{3}$'
cat $F | sed -rn "/$A/,/$B/p" | grep '^a#'
## N.B. the change to double quotes in the sed expression
Making the variables solution still more general -- you can give any letter range:
Code:
for LL in {A..C} # LL==LabelLetter
do eval $LL='^\\*{3}'"$LL"'\\*{3}$'
done
cat $F | sed -rn "/$A/,/$B/p" | grep '^a#'
## N.B. still double quotes in the sed expression
You cannot further generalize this by changing "C" in the brace expansion to a variable, bash doesn't allow that. You must change "C" to the the new end of the range. For a longer explanation, do a case sensitive search for "Brace Expansion" in the bash man page.
Last edited by archtoad6; 09-24-2008 at 09:10 AM.
Reason: add anchors, eliminate (accidentally) duplicated text
The Perl Haters Community grows beyond the imagination limits
I mean each time there is an OP who wants to parse a file with perl, there is always someone who suggests to use another tool like awk or python, that must stop it now!
Given that the file is divided into sections by clearly distinguishable delimiters, it seems prudent to use these as the default record separators, and let Perl do the work of separating records:
Code:
$\ = "";
Now, you can iterate over each section. It seems logical that the original poster's file may actually extend beyond paragraph '***C***', way out to paragraph '***Z***', maybe even beyond. We'd like to avoid proliferating 'if - elsif' clauses.
Code:
while(<>){
if( $_ =~ m/A/ ){
# We've found the record of interest, split it into fields, by line
my @fields = split /\n/, $_;
foreach $field( @fields ){
if( $field =~ m/a/ ){
print "$field\n";
}
}
}
}
It seems reasonable that one might want to change the record of interest to '***G***' or '***W***' some day, and from here it is easy to make that a commandline parameter. If we expect that there is some connection between the '***A***', and the 'a#', we can accommodate that in a generalized way, too.
Code:
#! /usr/bin/perl -w
use strict;
$/ = "";
open(LQSUOMALI, shift ) or die "Cannot open data file: $!\n";
while(<LQSUOMALI>){
if( $_ =~ m/$ARGV[0]/ ){
# We've found the record of interest, so split it into fields, by line
my @fields = split /\n/, $_;
foreach my $field( @fields ){
my $fieldName = lc($ARGV[0] );
if( $field =~ m/$fieldName/ ){
print "$field\n";
}
}
}
}
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.