Quote:
Originally Posted by Turbocapitalist
1. The text-base web browser, lynx, has a -dump option which will interpret and lay out the web page for you as plain text:
the problem with gumbo library is that it is unreadable because of the numerous links / jpg / png,...svg and stuffs that make not possible to be readable as a book.
Code:
lynx -dump http://www.example.com/
Or are you looking for some C library?
|
I tried to make with in C with stdin read ... not easy because there is gumbo but I want something static.
I can now parse a bit the html code but it is not easy to make from scratch.
If you have a static to parse and so same as html2txt, I take fairly... I'd like to have something portable, working even windows with basics: stdio string stdlib (min).
I have this main list of a ebook:
https://pastebin.com/raw/84YFd4VA
From each line, one do fetch the html (render) wikipage:
Code:
// wiki format
//https://en.wikipedia.org/w/index.php?action=raw&title=Linux
// make html
// https://en.wikipedia.org/w/index.php?action=render&title=Linux
// will make html file (html format)
So far, I got an ebook from wiki book with this :
Code:
ls *php* -1 | while read -r i ; do elinks -no-numbering -no-references "$i" >> merge4.txt ; done
it does still look readable for an ebook, but it could be better.
Maybe it could be nice to have an index.html that redirects to the html or txt pages.
TXT is better since it goes faster on an ebook. An ebook takes time to (low ressoure) to render an html document, thus, txt is much better.