sed: one or more occurrences of a pattern

David the H. · 11-06-2012, 08:58 AM

You appear to be running into exactly the problem I warned you about. Regular expressions have a very hard time dealing with nested content. The kind used in sed, at least*, has no way to "look ahead" to precisely determine which ending tag matches which starting one.

(*The perl-based regex flavor might be able to, since it has look-ahead and look-behind features built-into it, but that's not supported by sed).

xmlstarlet, however, has one more trick up its sleeve for you. It can convert xml into a line-based format called pyx, which is specifically designed to make parsing data easier with tools like grep and sed.

Code:

xmlstarlet fo -H -Q -R file.html | xmlstarlet pyx | sed -n '/^Aclass entry/,/^)div/ { /^-/ s///p }'

As I hope you can see, it usually makes your sed expressions much cleaner and easier, although for maximum benefit you do need to know how to do multi-line editing.

There can still be a few issues with matching nested tags, but since they are now cleanly spread out over multiple lines rather than potentially squashed up, it becomes more a question of setting up proper address ranges than of building complex regexes.

It also appears to be a bit more robust than using the xml parser directly. It doesn't hang up completely if there are minor syntax errors.

Here are a few useful sed references. See the first one in particular for more on how to use its multi-line features:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt