You appear to be running into exactly the problem I warned you about. Regular expressions have a very hard time dealing with nested content. The kind used in
sed, at least*, has no way to "look ahead" to precisely determine which ending tag matches which starting one.
(*The perl-based regex flavor might be able to, since it has look-ahead and look-behind features built-into it, but that's not supported by
sed).
xmlstarlet, however, has one more trick up its sleeve for you. It can convert
xml into a line-based format called
pyx, which is specifically designed to make parsing data easier with tools like
grep and
sed.
Code:
xmlstarlet fo -H -Q -R file.html | xmlstarlet pyx | sed -n '/^Aclass entry/,/^)div/ { /^-/ s///p }'
As I hope you can see, it usually makes your
sed expressions much cleaner and easier, although for maximum benefit you do need to know how to do multi-line editing.
There can still be a few issues with matching nested tags, but since they are now cleanly spread out over multiple lines rather than potentially squashed up, it becomes more a question of setting up proper address ranges than of building complex regexes.
It also appears to be a bit more robust than using the xml parser directly. It doesn't hang up completely if there are minor syntax errors.
Here are a few useful
sed references. See the first one in particular for more on how to use its multi-line features:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt