LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-18-2005, 08:48 AM   #1
rblampain
Member
 
Registered: Aug 2004
Location: Western Australia
Distribution: Debian 7
Posts: 833

Rep: Reputation: 35
strip html tags


Linux learner.
I have been searching the net without success for something to strip the HTML tags from a file. I only want to keep what's between > and < .
Any suggestions? Perhaps someone has a bash script.

Thank you for your help.
 
Old 06-18-2005, 09:00 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

This could be a good candidate: html2txt

http://www.icewalkers.com/Linux/Soft.../html2txt.html

or

http://rpmfind.net/linux/RPM/suse/9....1-73.i586.html

Hope this helps.
 
Old 06-18-2005, 09:10 AM   #3
Harmaa Kettu
Member
 
Registered: Apr 2005
Location: Finland
Posts: 196

Rep: Reputation: 30
The Perl Cookbook suggests using lynx:
Code:
lynx -dump file.html > file.txt
 
Old 06-18-2005, 09:15 AM   #4
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

@Harmaa Kettu: Nice solution.

Another day that I learned something new
 
Old 06-19-2005, 02:33 AM   #5
rblampain
Member
 
Registered: Aug 2004
Location: Western Australia
Distribution: Debian 7
Posts: 833

Original Poster
Rep: Reputation: 35
Thank you all. Druuna's solution worked well and I think my unsuccessful search on the net also raised the possibility of doing it with lynx.
Top advices.
 
Old 08-01-2005, 03:19 AM   #6
mad_juno
LQ Newbie
 
Registered: Jul 2005
Posts: 3

Rep: Reputation: 0
Sorry for bringing up this old topic, but I have a similar problem -- i need HTML tags stripped from .html files
The lynx -dump option is nice and tempting (html2text doesn't suit my intentions), but time after time there are files it doesn't work on! Unfortunately I'm no HTML expert and it is almost impossible to determine what goes wrong with lynx. It just outputs the .html file unaltered.
Is there indeed no better option than writing my own tag stripper in c++ (I don't know pearl). Any piece of advice? Please? Anybody?
 
Old 08-07-2005, 06:22 AM   #7
eddiebaby1023
Member
 
Registered: May 2005
Posts: 378

Rep: Reputation: 33
Code:
sed 's/<[^>]*>//g' file >newfile
will do it for a single file. It requires the opening and closing brackets to be the same line. I'll leave you to tailor it for your personal circumstances.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help to strip XML & XSL tags from multiple files dfrechet Programming 9 10-12-2005 06:52 AM
After Editing Tags with JuK - XMMS do not display tags correctly Artik Linux - Software 0 07-23-2005 05:55 AM
Bash script for correcting HTML tags hq4ever Programming 4 11-08-2004 04:06 AM
how can I seprate normal text from html tags spell check it & then again place it ins amit_28oct Programming 5 08-07-2004 07:09 AM
regular expression for parsing html tags Bert Linux - Software 3 10-14-2002 04:31 PM


All times are GMT -5. The time now is 09:04 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration