LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
LinkBack Search this Thread
Old 11-18-2004, 10:44 AM   #1
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
html to text + encoding?


I have a bunch of html etexts that I want to convert to simple text format so that I can read them on my pda text reader. Is there a simple program or script that can strip the codes from multiple html files at once then concatenate them in order into a single text file?

Second, I need to make sure that the text is in a character encoding the reader can handle. I'm using an older, pre-linux, Japanese Sharp Zaurus, and the text reader it has seems very limited. UTF-8 doesn't work so well in it. It has problems with punctuatlon characters. Western ISO-8859-1 encoding would probably work, but I haven't tried that yet. The best would be a Japanese encoding like shift_JIS, then I could convert Japanese texts as well.

Any advice here, or should I just do it all by hand?
 
Old 11-18-2004, 11:38 AM   #2
ahh
Member
 
Registered: May 2004
Location: UK
Distribution: Gentoo
Posts: 293

Rep: Reputation: 31
How about opening them in a browser, then just copy and paste the text into a text editor and saving in the character encoding required.
 
Old 11-18-2004, 12:18 PM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
Well, yes. That's what I was talking about when I mentioned doing it manually. But some of these texts span a dozen or more separate html pages. I was hoping for a way to do them all in batches so I don't have to open them all up individually and manually C&P them.
 
Old 11-18-2004, 12:34 PM   #4
ahh
Member
 
Registered: May 2004
Location: UK
Distribution: Gentoo
Posts: 293

Rep: Reputation: 31
Maybe an alternative approach would be to create one large html file from all the seperate ones, then you would only have to copy and paste once.

If they are just text all you would have to do is remove the html, head and body tags from all but the first and last page, although I expect modern browsers could cope even if you didn't do this.
 
Old 11-19-2004, 03:19 AM   #5
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 61
let's assume you're in the directory, under which all .html files are located, and you want to have them all in one .txt file:
Code:
find . \( -type f -o -type l \) \( -iname "*.html" -o -iname "*.htm" \) -exec links -dump {} \; >allHTML.txt
If you don't have links, lynx should do too.

Yves
 
Old 11-19-2004, 12:58 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
YinYeti, thanks. That looks like it might work. I'm unable to get nto my linux system just yet to try it out, but I will as soon as possible. Could you explain the commands in a bit more detail? I want to understand what I'm putting in.

On the other hand, I just found html2text, which looks like it might be more what I'm looking for. I've decided I want to write a bash script that I can call up any time I need it. I'm going to set it to cat the html files together then run them through this program (or vice versa, whichever works best), and finally write to a text file.

At least this will take care of the conversion to text. If I can only find a command line encoding converter that would do what I want, I could add it to the script as well. But at least I can convert it through OpenOffice or something.

I also want to try to do the same with some pdf files I have. Any recommendations on a good pdf to text converter?

Last edited by David the H.; 11-19-2004 at 01:00 PM.
 
Old 11-22-2004, 05:10 AM   #7
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 61
Code:
find . \( -type f -o -type l \) \( -iname "*.html" -o -iname "*.htm" \) -exec links -dump {} \; >allHTML.txt
find here (find .) all files (-type f) or (\( ... -o ... \)) links (-type l), the name of which (-iname), independently of character case, is ending in .html (*.html) or .htm (*.htm), and for each found item, which fullfills those requirements, execute (-exec ... \;) the command "links -dump" with the found item ({}) as an argument. The result of all this should be redirected to the file "allHTML.txt".

"links -dump file.html" is for writing to console the text equivalent of the HTML page. You can perfectly use another command inside the -exec \; to provide the same result. Or you could write this command, which is more souple:
Code:
find . \( -type f -o -type l \) \( -iname "*.html" -o -iname "*.htm" \) -print | while read file; do
  # do something with $file
done
Yves.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
text encoding, emacs and LaTeX. bnj Linux - Newbie 2 10-13-2005 07:56 AM
find out encoding of text kpachopoulos Linux - General 0 08-28-2005 03:03 PM
text to xml to html osio Programming 5 07-28-2005 12:39 PM
how to convert text(html) back to html. d1l2w3 Linux - Software 4 04-08-2005 08:16 PM
Converting Text To HTML Glock Shooter Programming 6 07-03-2002 06:08 PM


All times are GMT -5. The time now is 11:00 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration