parse text between html
Hi gurus, is there any elegant way how to get rid of html pairing tags and text inside those pairing tags ?
Or just remove text inside tags and preserve html tags ? (I can remove html tags after so this would not be problem) for example: Code:
<tag>text to be removed with or without tags</tag> Code:
<[^>]*>[^<]*</[^>]*> Code:
<tag><nested>text to be removed with or without tags</nested></tag> I think using sed's memory to memorize "<tag>" and then "</tag>" could be the way. But I am not sure if that is possible only in replace and not match section. Something like this Code:
sed -n 's/<([^>]*>)[^<]*<\/\1//gp' PS: Just for clear <br /> tags should not be treated because it will remove a lot of texts (I know <br> is not pairing... just for clear, also <br> can in first step replace by $$$$$ etc.) Sorry I have not linux box so I cant test It, but hope you understand what I am looking for. Thank you |
You have already learned the key lesson: don't parse HTML (or similar markups or data structures that allow for arbitrarily nested items) with regular expressions. You want to use a proper parser designed for the specific format you're parsing (HTML, XML whatever).
There are loads of good HTML parsers for the three big scripting languages (Perl, Ruby and Python). Don't reinvent the wheel if you don't have to and definitely don't try to force this through sed. |
Thanks, can you paste your favorite HTML parser ?
|
Quote:
Perl HTML parser - there is a number of interrelated ones. |
For Perl, HTML::Parser is very robust. I recommend it highly If you know Ruby, hpricot is very popular though I haven't used it myself.
|
All times are GMT -5. The time now is 12:38 PM. |