strip html tags

rblampain · 06-18-2005, 08:48 AM

Linux learner.
I have been searching the net without success for something to strip the HTML tags from a file. I only want to keep what's between > and < .
Any suggestions? Perhaps someone has a bash script.

Thank you for your help.

druuna · 06-18-2005, 09:00 AM

Hi,

This could be a good candidate: html2txt

http://www.icewalkers.com/Linux/Soft.../html2txt.html

or

http://rpmfind.net/linux/RPM/suse/9....1-73.i586.html

Hope this helps.

Harmaa Kettu · 06-18-2005, 09:10 AM

The Perl Cookbook suggests using lynx:

Code:

lynx -dump file.html > file.txt

druuna · 06-18-2005, 09:15 AM

Hi,

@Harmaa Kettu: Nice solution.

Another day that I learned something new

rblampain · 06-19-2005, 02:33 AM

Thank you all. Druuna's solution worked well and I think my unsuccessful search on the net also raised the possibility of doing it with lynx.
Top advices.

mad_juno · 08-01-2005, 03:19 AM

Sorry for bringing up this old topic, but I have a similar problem -- i need HTML tags stripped from .html files

The lynx -dump option is nice and tempting (html2text doesn't suit my intentions), but time after time there are files it doesn't work on! Unfortunately I'm no HTML expert and it is almost impossible to determine what goes wrong with lynx. It just outputs the .html file unaltered.
Is there indeed no better option than writing my own tag stripper in c++ (I don't know pearl). Any piece of advice? Please? Anybody?

eddiebaby1023 · 08-07-2005, 06:22 AM

Code:

sed 's/<[^>]*>//g' file >newfile

will do it for a single file. It requires the opening and closing brackets to be the same line. I'll leave you to tailor it for your personal circumstances.