html2txt ?

Xeratul · 07-29-2017, 06:15 AM

Hello,

I am programming a wikipedia fetcher from a list. I can create HTML books (goal for ebooks) with the C language.

However I would like to convert the HTML to text for simplicity.

1. elinks can convert html to text, but which other alternative would you know?
2. gumbo library + html2txt using it: https://github.com/lecram/html2txt
3. lynx -dump page.html >> merge.txt

thank you!

Turbocapitalist · 07-29-2017, 06:55 AM

1. The text-base web browser, lynx, has a -dump option which will interpret and lay out the web page for you as plain text:

Code:

lynx -dump http://www.example.com/

Or are you looking for some C library?

Xeratul · 07-29-2017, 08:01 AM

Quote:

Originally Posted by Turbocapitalist

1. The text-base web browser, lynx, has a -dump option which will interpret and lay out the web page for you as plain text:

the problem with gumbo library is that it is unreadable because of the numerous links / jpg / png,...svg and stuffs that make not possible to be readable as a book.

Code:

lynx -dump http://www.example.com/

Or are you looking for some C library?

I tried to make with in C with stdin read ... not easy because there is gumbo but I want something static.
I can now parse a bit the html code but it is not easy to make from scratch.

If you have a static to parse and so same as html2txt, I take fairly... I'd like to have something portable, working even windows with basics: stdio string stdlib (min).

I have this main list of a ebook:
https://pastebin.com/raw/84YFd4VA

From each line, one do fetch the html (render) wikipage:

Code:

    // wiki format
    //https://en.wikipedia.org/w/index.php?action=raw&title=Linux

    // make html 
    // https://en.wikipedia.org/w/index.php?action=render&title=Linux
    // will make html file (html format)

So far, I got an ebook from wiki book with this :

Code:

ls *php* -1  | while read -r i ; do    elinks -no-numbering -no-references  "$i"   >> merge4.txt ;  done

it does still look readable for an ebook, but it could be better.

Maybe it could be nice to have an index.html that redirects to the html or txt pages.

TXT is better since it goes faster on an ebook. An ebook takes time to (low ressoure) to render an html document, thus, txt is much better.

AwesomeMachine · 08-02-2017, 02:39 AM

You can try PERL: http://search.cpan.org/dist/HTML-Strip/Strip.pm

Teufel · 08-02-2017, 02:48 AM

if PHP available:
strip_tags

Xeratul · 08-02-2017, 05:14 AM

So far I used a C programme which is simple and efficient

here

Experimenting, still, but this C programme looks to be realiable.