parse text between html

wakatana · 10-27-2009, 06:20 AM

Hi gurus, is there any elegant way how to get rid of html pairing tags and text inside those pairing tags ?
Or just remove text inside tags and preserve html tags ? (I can remove html tags after so this would not be problem)

for example:

Code:

<tag>text to be removed with or without tags</tag>

I tried that regular expression

Code:

<[^>]*>[^<]*</[^>]*>

that works fine until I have "nested tags"

Code:

<tag><nested>text to be removed with or without tags</nested></tag>

that only match string "iniside" <nested> and not whole <tag>

I think using sed's memory to memorize "<tag>" and then "</tag>" could be the way. But I am not sure if that is possible only in replace and not match section. Something like this

Code:

sed -n 's/<([^>]*>)[^<]*<\/\1//gp'

PS: Just for clear <br /> tags should not be treated because it will remove a lot of texts (I know <br> is not pairing... just for clear, also <br> can in first step replace by $$$$$ etc.)

Sorry I have not linux box so I cant test It, but hope you understand what I am looking for. Thank you

Telemachos · 10-27-2009, 06:45 AM

You have already learned the key lesson: don't parse HTML (or similar markups or data structures that allow for arbitrarily nested items) with regular expressions. You want to use a proper parser designed for the specific format you're parsing (HTML, XML whatever).

There are loads of good HTML parsers for the three big scripting languages (Perl, Ruby and Python). Don't reinvent the wheel if you don't have to and definitely don't try to force this through sed.

wakatana · 10-27-2009, 07:36 AM

Thanks, can you paste your favorite HTML parser ?

Sergei Steshenko · 10-27-2009, 07:46 AM

Quote:

Originally Posted by wakatana

Thanks, can you paste your favorite HTML parser ?

Look up

Perl HTML parser

- there is a number of interrelated ones.

Telemachos · 10-27-2009, 08:12 AM

For Perl, HTML::Parser is very robust. I recommend it highly If you know Ruby, hpricot is very popular though I haven't used it myself.