LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-27-2011, 08:55 AM   #1
roBuntu1967
LQ Newbie
 
Registered: Feb 2011
Location: Dallas
Distribution: Ubuntu
Posts: 2

Rep: Reputation: 0
Cool Need help extracting text from .htm files


I downloaded (using wget) almost 3000 .htm files from a dictionary web site. Now I want to write a script that will extract the text from these .htm files. I'm a total newbie with awk/sed/perl/grep. Any suggestions?
 
Old 02-27-2011, 09:00 AM   #2
arizonagroovejet
Senior Member
 
Registered: Jun 2005
Location: England
Distribution: openSUSE, Fedora, CentOS
Posts: 1,078

Rep: Reputation: 195Reputation: 195
There's a utility called html2text. It's probably available in the repos of whatever distro you're using. Probably the package is called html2text. You might even have it installed already


Code:
$ which html2text
 
1 members found this post helpful.
Old 03-07-2011, 06:51 AM   #3
roBuntu1967
LQ Newbie
 
Registered: Feb 2011
Location: Dallas
Distribution: Ubuntu
Posts: 2

Original Poster
Rep: Reputation: 0
Thanks

OK, I will try html2txt. Thanks!
 
Old 03-07-2011, 07:24 AM   #4
knudfl
LQ 5k Club
 
Registered: Jan 2008
Location: Copenhagen, DK
Distribution: pclos2016, Slack14.1 Deb Jessie, + 50+ other Linux OS, for test only.
Posts: 16,276

Rep: Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154Reputation: 3154
This version of html2txt works perfect. ( html2text doesn't.)

http://www.linuxquestions.org/questi...5&d=1269459223
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is it possible to serve .htm files from cgi-bin directory? zoonalex Linux - Server 1 12-02-2008 04:53 PM
extracting data from html files into one text file adityavpratap Slackware 9 05-10-2007 11:30 AM
extracting a chunk of text from a large text file lothario Linux - Software 3 02-28-2007 09:16 AM
Copy all files with .htm extension dickb Linux - Newbie 3 06-28-2005 01:21 PM
backup file *.htm --> * .htm.bak rvoigt Linux - General 4 06-25-2005 05:39 PM


All times are GMT -5. The time now is 04:20 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration