LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-02-2012, 05:14 AM   #1
georgi
LQ Newbie
 
Registered: Apr 2012
Location: Bulgaria
Distribution: openSuSE
Posts: 15

Rep: Reputation: Disabled
Remove html tags with particular string inside


Hi everybody,

I would like to remove some tags from the "head" of multiple html documents across the web site. They look like

Code:
<link rel="alternate" type="application/rss+xml"
    title="Business and Investment in the Philippines" href="http://mydomain.com/rss/business.rss">
or
Code:
<link rel="alternate" type="text/x-opml"
    title="OPML all News needs from mydomain.com" href="http://mydomain.com/rss/god.opml">
The two strings that present all the time are

Code:
<link rel="alternate" type="text/x-opml"
and
Code:
<link rel="alternate" type="application/rss+xml"
The rest of the text within <link ...> may vary. My goal is to remove all <link> tags containing the two strings above.

any help is greatly appreciated.
 
Old 05-02-2012, 05:44 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,253

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
And what have you tried?
 
Old 05-02-2012, 08:01 AM   #3
georgi
LQ Newbie
 
Registered: Apr 2012
Location: Bulgaria
Distribution: openSuSE
Posts: 15

Original Poster
Rep: Reputation: Disabled
Code:
sed 's/<link rel="alternate" type="application/rss.*>//g' *.html
 
Old 05-03-2012, 02:29 AM   #4
georgi
LQ Newbie
 
Registered: Apr 2012
Location: Bulgaria
Distribution: openSuSE
Posts: 15

Original Poster
Rep: Reputation: Disabled
solution

i found the following command to complete almost all i needed

Code:
sed -i.bak 's/<link rel="alternate"[^>]*>//g' *.htm
 
Old 05-03-2012, 09:46 AM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
The problem with the above is that sed is line-based, and html is not. Your commands will only work if the whole tag exists on a single line. Not to mention that html has nested tags, which regular expressions can have a lot of trouble with.

Let's look at a quick example:

Code:
cat file.html
<html>
<body>

<a href="www.example.com">
This is a link to <i>example.com></i>
</a>

</body>
</html>
If we run (a modified version of) the above:

Code:
$ sed 's/<a [^>]*>//g' file.html
<html>
<body>

This is a link to <i>example.com></i>
</a>

</body>
</html>
Only the first line is removed.
Ok, so let's use a more robust multi-line expression:

Code:
$ sed '\|<a | { :x ; \|</a>|! { N ; bx } ; s|<a.*</a>|| }' file.html
<html>
<body>



</body>
</html>
So far so good. It leaves a few extra blank lines behind, but those can be cleaned up later if needed.

But what happens if we change the file up a little?

Code:
$ cat file.html
<html>
<body>

<a href="www.example.com">
This is a link to <i>example.com></i>
</a><a href="http://www.example2.com>"This is a link to <i>example2.com</i></a>

</body>
</html>

$ sed '\|<a | { :x ; \|</a>|! { N ; bx } ; s|<a.*</a>|| }' file.html
<html>
<body>



</body>
</html>
Rut roh! The second link is lost as well. And this is due to a weakness in the regex sed uses; there's no way to stop the greediness of "*" when the end-target is a multi-character expression. And if you used a single-character "[^>]*", as before, then it will either stop at the first tag it encounters, or fail to match entirely.

With a bit of work, we may be able to pre-process the line, or even the file to split tags more evenly, but we're getting ever more complex here. And a perl-style lookahead or non-greedy expression could handle it more easily, but sed doesn't support them.

The short of it is, unless the input is very regular and unvarying, and you tweak your expressions just right, line/regex-based tools like sed just aren't safe for html/xml. You need to use a tool with a parser dedicated to reading those formats.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
using tr strip away html tags Micky12345 Linux - Newbie 5 03-17-2012 02:03 PM
Html tags inside PHP mail body message. linuxlover.chaitanya Programming 7 03-12-2010 01:50 AM
How to use AWK to search for a specific string inside html? LuxLuv Programming 4 01-28-2010 08:54 AM
mutt and html tags cizzi Linux - Software 3 03-30-2008 09:21 PM
Vim Spell Check Inside Tags? Optimistic Linux - Software 1 04-15-2005 03:36 AM


All times are GMT -5. The time now is 10:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration