LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-27-2009, 07:20 AM   #1
wakatana
Member
 
Registered: Jul 2009
Location: Slovakia
Posts: 133

Rep: Reputation: 16
parse text between html


Hi gurus, is there any elegant way how to get rid of html pairing tags and text inside those pairing tags ?
Or just remove text inside tags and preserve html tags ? (I can remove html tags after so this would not be problem)

for example:

Code:
<tag>text to be removed with or without tags</tag>
I tried that regular expression
Code:
<[^>]*>[^<]*</[^>]*>
that works fine until I have "nested tags"

Code:
<tag><nested>text to be removed with or without tags</nested></tag>
that only match string "iniside" <nested> and not whole <tag>


I think using sed's memory to memorize "<tag>" and then "</tag>" could be the way. But I am not sure if that is possible only in replace and not match section. Something like this

Code:
sed -n 's/<([^>]*>)[^<]*<\/\1//gp'

PS: Just for clear <br /> tags should not be treated because it will remove a lot of texts (I know <br> is not pairing... just for clear, also <br> can in first step replace by $$$$$ etc.)

Sorry I have not linux box so I cant test It, but hope you understand what I am looking for. Thank you
 
Old 10-27-2009, 07:45 AM   #2
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
You have already learned the key lesson: don't parse HTML (or similar markups or data structures that allow for arbitrarily nested items) with regular expressions. You want to use a proper parser designed for the specific format you're parsing (HTML, XML whatever).

There are loads of good HTML parsers for the three big scripting languages (Perl, Ruby and Python). Don't reinvent the wheel if you don't have to and definitely don't try to force this through sed.
 
Old 10-27-2009, 08:36 AM   #3
wakatana
Member
 
Registered: Jul 2009
Location: Slovakia
Posts: 133

Original Poster
Rep: Reputation: 16
Thanks, can you paste your favorite HTML parser ?
 
Old 10-27-2009, 08:46 AM   #4
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by wakatana View Post
Thanks, can you paste your favorite HTML parser ?
Look up

Perl HTML parser

- there is a number of interrelated ones.
 
Old 10-27-2009, 09:12 AM   #5
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
For Perl, HTML::Parser is very robust. I recommend it highly If you know Ruby, hpricot is very popular though I haven't used it myself.
 
  


Reply

Tags
delete, expressions, html, parse, regular


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Simple parse of html file using bash ericcarlson Linux - Software 2 05-07-2008 10:44 AM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 03:36 PM
Parse error: parse error, unexpected $ in /home/content/d/o/m/domain/html/addpuppy2.p Scooby-Doo Programming 3 10-25-2007 10:41 AM
Parse HTML using PHP jilljack Programming 1 11-07-2005 10:46 AM
parse HTML file and find keywords ? fnd Programming 8 06-09-2004 01:35 PM


All times are GMT -5. The time now is 12:46 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration