LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-16-2012, 12:55 PM   #1
Micky12345
Member
 
Registered: Feb 2012
Posts: 58

Rep: Reputation: Disabled
using tr strip away html tags


i have a html file

<html>
<title>Hello</title>
</html>

In this html file i want to remove the tags from the file.

So i did
Code:
 cat filename | tr -d '</>
but it deletes only <,>, and /

but not words inside

for eg:
from the tag <html>

html will remain as it is only <,> get removed.

so i tried
Code:
cat filename | tr '<[a-b]> '[a-b]'
but its not giving methe expected answer

can anyone help me in solving this?
thanks in advance
 
Old 03-16-2012, 01:12 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,500

Rep: Reputation: 415Reputation: 415Reputation: 415Reputation: 415Reputation: 415
Try this...
Code:
echo "<html>
<title>Hello</title>
</html>"  \
|sed -e 's/<[^>]*>//g'
Daniel B. Martin
 
1 members found this post helpful.
Old 03-16-2012, 01:23 PM   #3
Andrew Benton
Senior Member
 
Registered: Aug 2003
Location: Birkenhead/Britain
Distribution: Linux From Scratch
Posts: 2,073

Rep: Reputation: 64
Code:
cat foo.html | w3m -dump -T text/html > foo.txt
 
Old 03-16-2012, 04:33 PM   #4
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,396
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
The solution offered by Andrew Benton is a specific instance of the more general solution that includes a proper HTML parser. Trying to manipulate HTML with basic commandline text tools is an exercise in futility. It takes a fair bit of code to create a robust parser of XML/HTML, unless you can make a lot of significant assumptions (and we all know where that leads...).
Another solution that can be robust and complete would be to use a ready-made Perl package such as HTML::Parser. Other languages such as Python, Ruby, etc., probably have equivalent packages.

--- rod.
 
1 members found this post helpful.
Old 03-17-2012, 10:14 AM   #5
Micky12345
Member
 
Registered: Feb 2012
Posts: 58

Original Poster
Rep: Reputation: Disabled
Code:
sed -e 's/<[^>]*>//g' filename
worked for me

But , I want it using tr command is it possible?? with tr


Thnks danielbmartin for your reply
 
Old 03-17-2012, 02:03 PM   #6
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,396
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
tr operates on individual characters and classes of characters, not patterns. If you use tr to delete characters, it cannot distinguish those that are part of the tag from those that are part of the literal data. There is a reason that regular expressions are part of many text manipulation tools.
--- rod.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Strip html-document olleolle Linux - Newbie 4 01-17-2012 05:26 AM
Curl getting html tags aliahsan81 Programming 8 07-31-2009 10:02 AM
Strip Mime & HTML from MBOX files Andrew_OC Linux - Server 8 03-28-2007 04:18 AM
Need help to strip XML & XSL tags from multiple files dfrechet Programming 9 10-12-2005 07:52 AM
strip html tags rblampain Programming 6 08-07-2005 07:22 AM


All times are GMT -5. The time now is 02:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration