regex on multilines

Felipe · 10-10-2008, 02:01 AM

Hallo:

I'm trying to do a regular expresesion search on multilines:

EJ:

...
{
....
name = name1;
....
value =value1;
....
code = code1;
....
}

{
....
name = name2;
....
value =value2;
....
code = code1;
....
}
...
I've many structures on this way.
If search for example "name = name2", structure 2 should be return (all information betwen { }); If search for "code = code1" both structures should be return;

I've found this regular expression:
sed -n -e "/{/,/}/p" file
but I'm returned all structures.

With:
sed -e "/./{H;$!d;}" -e "x;/name = name2/!d;" file
I'm returned what I'm looking for, but blank lines are the separator instead of { }.

Any idea?

Thanks

jschiwal · 10-10-2008, 03:21 AM

Having a blank line between records make things easier because you can use a range ending in /^$/ which is a common practice.
Using "{" and "}" makes things very difficult because ranges are delineated with "{" and "}". Brackets are also used to group commands.

Code:

jschiwal@qosmio:~> sed -n '/{/,/^$/
                                   { /{/,/name = name1;/H
                                     /name = name1/,/}/{ 
                                                         /name = name1/n;H;
                                                         /}/{g;p}
                                                       }
                                   }' testfile

{
....
name = name1;
....
value =value1;
....
code = code1;
....
}

/{/,/^$/ is a range of a single record.

There are two subranges which contain the match you are looking for:
/{/,/name = name1;/ is a subrange which includes the test, from the first line to the match.
/name = name1;/,/}/ is a subrange from the matching line to the end of the record.

because the matching line is listed twice, the "n" command skips the line and Holds the next line
/name = name1/n;H;
/}/ matches the end of the record (with the match) so get the held line and print them. Note that the two commands are bracketed so
that the commands are executed on the same line.

Another approach could be to design a state machine using :labels and "b" or "t" branches.

jschiwal · 10-10-2008, 04:01 AM

I noticed that I forgot to clear out the Hold buffer after the "}" character.

Code:

sed -n '/{/,/^$/{
                  /{/,/code = code1;/H
                  /code = code1/,/}/{
                                      /code = code1/n;H;
                                      /}/{ g;p }
                                      /}/{ s/.*//;x }
                                    }
                }' testfile

Felipe · 10-10-2008, 06:51 AM

YES, it works fine.!!!!!!!!

I will have to study how....

Thanks

Felipe

sundialsvcs · 10-10-2008, 08:56 AM

(Shrug...) "It's just me, but" I'd use awk for that.

An awk "program" is really drop-dead simple: it consists of one or more blocks that look like this:

Code:

  /pattern/
     {
      block of code to be executed if this pattern was matched,
      written in a "vaguely 'C'-like" language.
     }

There are also "pseudo-patterns," like BEGIN (which executes before the first line is read), and END.

"And that's about it."

What I really like about it is that it's obviously designed-for tasks just like the one you are facing. Furthermore, the solution remains easy-to-read and easy-to-change.

I mean... yes, I could "figure out" that sed-script that you've shown, and a few weeks from now I could "figure it out all over again," but (a) I would indeed have to do that, and (b) it would require about the same amount of mental-effort each time. And the same would be true if I had written it myself. Also, (c) if the input format subsequently changed, however slightly, I fear that I would be having to figure-it-out all over again!

I therefore find that awk, and its "honorary big-brother" perl (which is a full-fledged programming language), are much more suited to these common tasks.

As the Perl community likes to say, TMTOWTDI = "there's more than one way to do it." That's especially true in Unix/Linux. It's definitely worth your time to spend the time poking around your system.

jschiwal · 10-10-2008, 11:52 PM

Feel free to submit your awk solution. The poster had tried sed. I was trying to show some things with sed that might be useful.

Use address pattern ranges to enclose the range to work with.
Use brackets enclose subranges.
Use brackets to enclose a group of commands on a pattern. ( This can prevent lossing lines saved with the N command, etc. )
The end of the first subrange was the desired pattern.
The beginning of the second subrange was the desired pattern. Doing this, I prevented any actions on a record that didn't have a match.

The sed command looks worse than it might have. Brackets were used to mark the boundaries of each record. I couldn't use different characters like I could to substitute forward slashes. I did use indentation and newlines to show the boundaries better. Perhaps I should have annotated the program itself with comments. I worked it out in an interactive shell and cut and pasted it into the post. I did try to explain what I did however. Used subranges joined by the desired pattern. Having to escape characters does make things messy. If one learns to look past them, it doesn't look as bad. But I agree that regular expressions are easier to write then to read.

All I needed to do to match the second pattern was substitute the "code = code1". That's when I noticed my mistake.

Gawk is almost 5 times as large as sed. It works wonders for text databases in regular fields. I wouldn't have considered sed in that case.

Perl is 25 times the size of sed, not counting modules you might load. And the OP probably doesn't know it. Recommending learning an entire language to solve a particular problem sounds a bit like an RTFM response. Don't get me wrong. I'm not blasting Perl, but to someone who isn't a proficient perl programmer (and I'm not), Perl can look a lot like you describe sed, to a Python or Ruby programmer.

===

I confess that I could have taken more time to describe my strategy. The address range matches a record. Both of the subranges only match a part of a record with the desired line in them. As each line is read in, it is saved in the Hold register until the end of record when it is printed out.

An alternative method could have been to save each line on every record until the closing bracket. Then retrieve the multiline record from the Hold space and test for the pattern: '{.*name =name1.*}'.