rblampain 06-18-2005 09:48 AM

strip html tags
Linux learner.
I have been searching the net without success for something to strip the HTML tags from a file. I only want to keep what's between > and < .
Any suggestions? Perhaps someone has a bash script.

Thank you for your help.

druuna 06-18-2005 10:00 AM


This could be a good candidate: html2txt


Hope this helps.

Harmaa Kettu 06-18-2005 10:10 AM

The Perl Cookbook suggests using lynx:

lynx -dump file.html > file.txt

druuna 06-18-2005 10:15 AM


@Harmaa Kettu: Nice solution.

Another day that I learned something new :)

rblampain 06-19-2005 03:33 AM

Thank you all. Druuna's solution worked well and I think my unsuccessful search on the net also raised the possibility of doing it with lynx.
Top advices.

mad_juno 08-01-2005 04:19 AM

Sorry for bringing up this old topic, but I have a similar problem -- i need HTML tags stripped from .html files:(
The lynx -dump option is nice and tempting (html2text doesn't suit my intentions), but time after time there are files it doesn't work on! Unfortunately I'm no HTML expert and it is almost impossible to determine what goes wrong with lynx. It just outputs the .html file unaltered.
Is there indeed no better option than writing my own tag stripper in c++ (I don't know pearl). Any piece of advice? Please? Anybody?

eddiebaby1023 08-07-2005 07:22 AM


sed 's/<[^>]*>//g' file >newfile
will do it for a single file. It requires the opening and closing brackets to be the same line. I'll leave you to tailor it for your personal circumstances.:)

