LinuxQuestions.org - html to text + encoding?

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - html to text + encoding? (https://www.linuxquestions.org/questions/linux-general-1/html-to-text-encoding-256448/)

David the H.

11-18-2004 10:44 AM

html to text + encoding?

I have a bunch of html etexts that I want to convert to simple text format so that I can read them on my pda text reader. Is there a simple program or script that can strip the codes from multiple html files at once then concatenate them in order into a single text file?

Second, I need to make sure that the text is in a character encoding the reader can handle. I'm using an older, pre-linux, Japanese Sharp Zaurus, and the text reader it has seems very limited. UTF-8 doesn't work so well in it. It has problems with punctuatlon characters. Western ISO-8859-1 encoding would probably work, but I haven't tried that yet. The best would be a Japanese encoding like shift_JIS, then I could convert Japanese texts as well.

Any advice here, or should I just do it all by hand?

ahh	11-18-2004 11:38 AM

How about opening them in a browser, then just copy and paste the text into a text editor and saving in the character encoding required.

David the H.

11-18-2004 12:18 PM

Well, yes. That's what I was talking about when I mentioned doing it manually. But some of these texts span a dozen or more separate html pages. I was hoping for a way to do them all in batches so I don't have to open them all up individually and manually C&P them.

ahh	11-18-2004 12:34 PM

Maybe an alternative approach would be to create one large html file from all the seperate ones, then you would only have to copy and paste once.

If they are just text all you would have to do is remove the html, head and body tags from all but the first and last page, although I expect modern browsers could cope even if you didn't do this.

theYinYeti

11-19-2004 03:19 AM

let's assume you're in the directory, under which all .html files are located, and you want to have them all in one .txt file:

Code:

find . \( -type f -o -type l \) \( -iname "*.html" -o -iname "*.htm" \) -exec links -dump {} \; >allHTML.txt

If you don't have links, lynx should do too.

Yves

David the H.

11-19-2004 12:58 PM

YinYeti, thanks. That looks like it might work. I'm unable to get nto my linux system just yet to try it out, but I will as soon as possible. Could you explain the commands in a bit more detail? I want to understand what I'm putting in.

On the other hand, I just found html2text, which looks like it might be more what I'm looking for. I've decided I want to write a bash script that I can call up any time I need it. I'm going to set it to cat the html files together then run them through this program (or vice versa, whichever works best), and finally write to a text file.

At least this will take care of the conversion to text. If I can only find a command line encoding converter that would do what I want, I could add it to the script as well. But at least I can convert it through OpenOffice or something.

I also want to try to do the same with some pdf files I have. Any recommendations on a good pdf to text converter? :)

theYinYeti

11-22-2004 05:10 AM

Code:

find . \( -type f -o -type l \) \( -iname "*.html" -o -iname "*.htm" \) -exec links -dump {} \; >allHTML.txt

find here (find .) all files (-type f) or (\( ... -o ... \)) links (-type l), the name of which (-iname), independently of character case, is ending in .html (*.html) or .htm (*.htm), and for each found item, which fullfills those requirements, execute (-exec ... \;) the command "links -dump" with the found item ({}) as an argument. The result of all this should be redirected to the file "allHTML.txt".

"links -dump file.html" is for writing to console the text equivalent of the HTML page. You can perfectly use another command inside the -exec \; to provide the same result. Or you could write this command, which is more souple:

Code:

find . \( -type f -o -type l \) \( -iname "*.html" -o -iname "*.htm" \) -print | while read file; do

  # do something with $file

done

Yves.

All times are GMT -5. The time now is 11:34 AM.