LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 11-03-2013, 12:27 PM   #1
robertjinx
Member
 
Registered: Oct 2007
Location: Prague, CZ
Distribution: RedHat / CentOS / Ubuntu / SUSE / Debian
Posts: 749

Rep: Reputation: 73
Question How to use sed or awk to drop html tags


Hello, got the following html code:

Code:
<html><head><title>Current IP Check</title></head><body>Current IP Address: 10.10.2.1</body></html>
and I would like to use sed and/or awk to drop all the html tags, like <html>, <title>, etc.

Can someone help?
 
Old 11-03-2013, 12:39 PM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
Using sed or awk to remove html tags is rather tricky, you're better of using a dedicated program to do that.

html2text comes to mind and if you are familiar with perl then there are some specific modules that can help you.
 
2 members found this post helpful.
Old 11-04-2013, 07:28 AM   #3
robertjinx
Member
 
Registered: Oct 2007
Location: Prague, CZ
Distribution: RedHat / CentOS / Ubuntu / SUSE / Debian
Posts: 749

Original Poster
Rep: Reputation: 73
Your idea works, but it means having html2text installed. Would like something which wouldn't need an extra package to be installed.
 
Old 11-04-2013, 07:51 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
you can try sed -e 's/<[^<>]*>//g' filename, but as it was mentioned it is not really safe and may drop other parts as well.
 
Old 11-04-2013, 10:47 AM   #5
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
There's a stackoverflow FAQ on why HTML is not a regular language and best not handled with regular expressions. You might get away with it in limited cases though.
 
Old 11-04-2013, 12:46 PM   #6
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,309
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
There are too many variations that can cause regular expressions to fail with HTML. You do need a real parser. XHTML, being XML is a little better, but even there you need a real parser. However, it does not have to be anything fancy. If you have lynx, you can use that.

Code:
lynx -nolist -dump http://www.example.com/
 
1 members found this post helpful.
Old 11-06-2013, 06:47 AM   #7
tombelcher7
Member
 
Registered: Feb 2008
Location: Surrey
Distribution: Debian
Posts: 214

Rep: Reputation: 5
I'm a novice here but is there any possibility of using Javascipt to drop through the Document Object Model and grab the Text nodes?

Just an idea?
 
Old 11-06-2013, 06:54 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
have you checked my sed oneliner? You can implement something similar in java(script) too, but probalby you can try a real html parser. http://ejohn.org/blog/pure-javascript-html-parser/
http://stackoverflow.com/questions/4...-in-javascript
 
Old 11-07-2013, 02:01 AM   #9
robertjinx
Member
 
Registered: Oct 2007
Location: Prague, CZ
Distribution: RedHat / CentOS / Ubuntu / SUSE / Debian
Posts: 749

Original Poster
Rep: Reputation: 73
Thank you all for help. It's not exactly what I was looking for, but it does the job.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extarct tags with multiline values from XML file using sed/Awk gbms Linux - Newbie 3 03-27-2012 10:18 AM
[Grep,Awk,Sed]Parsing text between XML tags. ////// Programming 5 07-26-2011 11:54 AM
Curl getting html tags aliahsan81 Programming 8 07-31-2009 09:02 AM
Drop tags llista LQ Suggestions & Feedback 2 09-09-2007 02:54 AM
How do I cut out a specific piece of a html page (using sed/awk or similar)? bomix Linux - General 2 10-08-2005 04:30 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 01:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration