LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   strip html tags (http://www.linuxquestions.org/questions/programming-9/strip-html-tags-334869/)

rblampain 06-18-2005 08:48 AM

strip html tags
 
Linux learner.
I have been searching the net without success for something to strip the HTML tags from a file. I only want to keep what's between > and < .
Any suggestions? Perhaps someone has a bash script.

Thank you for your help.

druuna 06-18-2005 09:00 AM

Hi,

This could be a good candidate: html2txt

http://www.icewalkers.com/Linux/Soft.../html2txt.html

or

http://rpmfind.net/linux/RPM/suse/9....1-73.i586.html

Hope this helps.

Harmaa Kettu 06-18-2005 09:10 AM

The Perl Cookbook suggests using lynx:
Code:

lynx -dump file.html > file.txt

druuna 06-18-2005 09:15 AM

Hi,

@Harmaa Kettu: Nice solution.

Another day that I learned something new :)

rblampain 06-19-2005 02:33 AM

Thank you all. Druuna's solution worked well and I think my unsuccessful search on the net also raised the possibility of doing it with lynx.
Top advices.

mad_juno 08-01-2005 03:19 AM

Sorry for bringing up this old topic, but I have a similar problem -- i need HTML tags stripped from .html files:(
The lynx -dump option is nice and tempting (html2text doesn't suit my intentions), but time after time there are files it doesn't work on! Unfortunately I'm no HTML expert and it is almost impossible to determine what goes wrong with lynx. It just outputs the .html file unaltered.
Is there indeed no better option than writing my own tag stripper in c++ (I don't know pearl). Any piece of advice? Please? Anybody?

eddiebaby1023 08-07-2005 06:22 AM

Code:

sed 's/<[^>]*>//g' file >newfile
will do it for a single file. It requires the opening and closing brackets to be the same line. I'll leave you to tailor it for your personal circumstances.:)


All times are GMT -5. The time now is 04:40 AM.