LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-29-2017, 06:15 AM   #1
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,657

Rep: Reputation: 255Reputation: 255Reputation: 255
html2txt ?


Hello,

I am programming a wikipedia fetcher from a list. I can create HTML books (goal for ebooks) with the C language.

However I would like to convert the HTML to text for simplicity.

1. elinks can convert html to text, but which other alternative would you know?
2. gumbo library + html2txt using it: https://github.com/lecram/html2txt
3. lynx -dump page.html >> merge.txt

thank you!

Last edited by Xeratul; 07-29-2017 at 08:05 AM.
 
Old 07-29-2017, 06:55 AM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,309
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
1. The text-base web browser, lynx, has a -dump option which will interpret and lay out the web page for you as plain text:

Code:
lynx -dump http://www.example.com/
Or are you looking for some C library?
 
Old 07-29-2017, 08:01 AM   #3
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,657

Original Poster
Rep: Reputation: 255Reputation: 255Reputation: 255
Quote:
Originally Posted by Turbocapitalist View Post
1. The text-base web browser, lynx, has a -dump option which will interpret and lay out the web page for you as plain text:

the problem with gumbo library is that it is unreadable because of the numerous links / jpg / png,...svg and stuffs that make not possible to be readable as a book.

Code:
lynx -dump http://www.example.com/
Or are you looking for some C library?
I tried to make with in C with stdin read ... not easy because there is gumbo but I want something static.
I can now parse a bit the html code but it is not easy to make from scratch.

If you have a static to parse and so same as html2txt, I take fairly... I'd like to have something portable, working even windows with basics: stdio string stdlib (min).

I have this main list of a ebook:
https://pastebin.com/raw/84YFd4VA

From each line, one do fetch the html (render) wikipage:
Code:
    // wiki format
    //https://en.wikipedia.org/w/index.php?action=raw&title=Linux

    // make html 
    // https://en.wikipedia.org/w/index.php?action=render&title=Linux
    // will make html file (html format)
So far, I got an ebook from wiki book with this :
Code:
ls *php* -1  | while read -r i ; do    elinks -no-numbering -no-references  "$i"   >> merge4.txt ;  done
it does still look readable for an ebook, but it could be better.

Maybe it could be nice to have an index.html that redirects to the html or txt pages.

TXT is better since it goes faster on an ebook. An ebook takes time to (low ressoure) to render an html document, thus, txt is much better.

Last edited by Xeratul; 07-29-2017 at 08:47 AM.
 
Old 08-02-2017, 02:39 AM   #4
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,524

Rep: Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015
You can try PERL: http://search.cpan.org/dist/HTML-Strip/Strip.pm
 
Old 08-02-2017, 02:48 AM   #5
Teufel
Member
 
Registered: Apr 2012
Distribution: Gentoo
Posts: 616

Rep: Reputation: 142Reputation: 142
if PHP available:
strip_tags
 
Old 08-02-2017, 05:14 AM   #6
Xeratul
Senior Member
 
Registered: Jun 2006
Location: UNIX
Distribution: FreeBSD
Posts: 2,657

Original Poster
Rep: Reputation: 255Reputation: 255Reputation: 255
So far I used a C programme which is simple and efficient
here
Experimenting, still, but this C programme looks to be realiable.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 01:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration