LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   parse text between html (https://www.linuxquestions.org/questions/programming-9/parse-text-between-html-764761/)

wakatana 10-27-2009 06:20 AM

parse text between html
 
Hi gurus, is there any elegant way how to get rid of html pairing tags and text inside those pairing tags ?
Or just remove text inside tags and preserve html tags ? (I can remove html tags after so this would not be problem)

for example:

Code:

<tag>text to be removed with or without tags</tag>
I tried that regular expression
Code:

<[^>]*>[^<]*</[^>]*>
that works fine until I have "nested tags"

Code:

<tag><nested>text to be removed with or without tags</nested></tag>
that only match string "iniside" <nested> and not whole <tag>


I think using sed's memory to memorize "<tag>" and then "</tag>" could be the way. But I am not sure if that is possible only in replace and not match section. Something like this

Code:

sed -n 's/<([^>]*>)[^<]*<\/\1//gp'

PS: Just for clear <br /> tags should not be treated because it will remove a lot of texts (I know <br> is not pairing... just for clear, also <br> can in first step replace by $$$$$ etc.)

Sorry I have not linux box so I cant test It, but hope you understand what I am looking for. Thank you

Telemachos 10-27-2009 06:45 AM

You have already learned the key lesson: don't parse HTML (or similar markups or data structures that allow for arbitrarily nested items) with regular expressions. You want to use a proper parser designed for the specific format you're parsing (HTML, XML whatever).

There are loads of good HTML parsers for the three big scripting languages (Perl, Ruby and Python). Don't reinvent the wheel if you don't have to and definitely don't try to force this through sed.

wakatana 10-27-2009 07:36 AM

Thanks, can you paste your favorite HTML parser ?

Sergei Steshenko 10-27-2009 07:46 AM

Quote:

Originally Posted by wakatana (Post 3734031)
Thanks, can you paste your favorite HTML parser ?

Look up

Perl HTML parser

- there is a number of interrelated ones.

Telemachos 10-27-2009 08:12 AM

For Perl, HTML::Parser is very robust. I recommend it highly If you know Ruby, hpricot is very popular though I haven't used it myself.


All times are GMT -5. The time now is 12:38 PM.