[SOLVED] How to use sed or awk to drop html tags

robertjinx · 11-03-2013, 12:27 PM

Hello, got the following html code:

Code:

<html><head><title>Current IP Check</title></head><body>Current IP Address: 10.10.2.1</body></html>

and I would like to use sed and/or awk to drop all the html tags, like <html>, <title>, etc.

Can someone help?

druuna · 11-03-2013, 12:39 PM

Using sed or awk to remove html tags is rather tricky, you're better of using a dedicated program to do that.

html2text comes to mind and if you are familiar with perl then there are some specific modules that can help you.

robertjinx · 11-04-2013, 07:28 AM

Your idea works, but it means having html2text installed. Would like something which wouldn't need an extra package to be installed.

pan64 · 11-04-2013, 07:51 AM

you can try sed -e 's/<[^<>]*>//g' filename, but as it was mentioned it is not really safe and may drop other parts as well.

linosaurusroot · 11-04-2013, 10:47 AM

There's a stackoverflow FAQ on why HTML is not a regular language and best not handled with regular expressions. You might get away with it in limited cases though.

Turbocapitalist · 11-04-2013, 12:46 PM

There are too many variations that can cause regular expressions to fail with HTML. You do need a real parser. XHTML, being XML is a little better, but even there you need a real parser. However, it does not have to be anything fancy. If you have lynx, you can use that.

Code:

lynx -nolist -dump http://www.example.com/

tombelcher7 · 11-06-2013, 06:47 AM

I'm a novice here but is there any possibility of using Javascipt to drop through the Document Object Model and grab the Text nodes?

Just an idea?

pan64 · 11-06-2013, 06:54 AM

have you checked my sed oneliner? You can implement something similar in java(script) too, but probalby you can try a real html parser. http://ejohn.org/blog/pure-javascript-html-parser/
http://stackoverflow.com/questions/4...-in-javascript

robertjinx · 11-07-2013, 02:01 AM

Thank you all for help. It's not exactly what I was looking for, but it does the job.