LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-25-2014, 03:08 AM   #1
slacker_
Member
 
Registered: Aug 2013
Distribution: Arch, Debian, Slackware
Posts: 333

Rep: Reputation: 3
Is there a way to download an html/php book with wget or something?


As an example I am wanting to try to download this CentOS book. I found an old thread that here that uses wget as follows:

Code:
wget -r -np http://www.tldp.org/LDP/abs/html/index.html
1) that works well enough, but I would like it to sort by chapters if at all posible

and

2) it doesn't work for sites like the one linked above (CentOS book), I think because it is php? I'm not sure.

Does anyone have a solution for this?
 
Old 05-25-2014, 08:33 AM   #2
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 1,868

Rep: Reputation: 250Reputation: 250Reputation: 250
There are a dozen ways to do that.

Code:
lynx -dump -listonly http://www.techotopia.com/index.php/CentOS_6_Essentials
You can put the chapter links into a text file like example:

getfile.txt
Code:
http://www.techotopia.com/index.php/Performing_a_CentOS_6_Network_Installation
http://www.techotopia.com/index.php/Installing_CentOS_6_with_Windows_in_a_Dual_Boot_Environment
http://www.techotopia.com/index.php/Allocating_a_Windows_Disk_Partition_to_CentOS_6
http://www.techotopia.com/index.php/Configuring_CentOS_6_GNOME_Screen_Resolution_and_Multiple_Monitors
Then to get the pages
Code:
wget -i getfile.txt
Then convert htmltopdf or the format you wish.

Or you could get a formatted .txt of a page with
Code:
lynx -dump http://www.techotopia.com/index.php/Performing_a_CentOS_6_Network_Installation > chapter3.txt
Or use you web browsers ability to print to .pdf for each page.

Or write a script to automate it.
 
Old 05-25-2014, 12:24 PM   #3
slacker_
Member
 
Registered: Aug 2013
Distribution: Arch, Debian, Slackware
Posts: 333

Original Poster
Rep: Reputation: 3
Brilliant!

I know I could go page by page and just "save as pdf" but that is wayyy too slow. Is lynx dump a method that will pull each subsequent or linked to page? If not then what is the best automated method to do that?

If i was going to make a getfile, is there also an autmated method for grabbing the url of each link on a page like the one linked in OP? Each chapter and section is it's own link, which can be quite a lot of copy/pasting.
 
Old 05-25-2014, 03:18 PM   #4
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 1,868

Rep: Reputation: 250Reputation: 250Reputation: 250
A basic tutorial for you. Do each and look at the results to get an understanding.

Code:
lynx -dump -listonly http://www.techotopia.com/index.php/CentOS_6_Essentials > dump.txt
Now take a look at dump.txt
Code:
cat dump.txt | head -n 280 > dump2.txt
Now look at dump2.txt
Code:
cat dump2.txt | tail -n 263 > dump3.txt
Now look at dump3.txt
Code:
cat dump3.txt | sed  's .\{6\}  ' > dump4.txt
Look at dump4.txt
Code:
cat dump4.txt | sort -u > dump5.txt
Look at dump5.txt
Code:
wget -U Mozilla/5.0 -i dump5.txt
You'll get about 38 pages.
If you want to convert them on the fly write a script using wget, curl, lynx, html2pdf, html2text etc. to do what you wish.
There are also apps like htmldoc.

Look at:
man cat
man wget
man head
man tail
man sort
man lynx
man sed

After understanding the above,
Code:
a=1; for i in $(cat dump5.txt); do lynx -dump $i > CentOS$a.txt; let a++; done
Which is the same as
Code:
#! /usr/bin/env bash

a=1
for i in $(cat dump5.txt); do
lynx -dump $i > CentOS$a.txt
let a++
done
Will convert those 38 pages to separate text files. If you want .pdf, .ps, .jpg etc. then use something else than lynx.
Edit: Fixed wget syntax and added example script.

Last edited by teckk; 05-25-2014 at 04:35 PM.
 
1 members found this post helpful.
Old 05-27-2014, 10:47 PM   #5
slacker_
Member
 
Registered: Aug 2013
Distribution: Arch, Debian, Slackware
Posts: 333

Original Poster
Rep: Reputation: 3
between dump 2 and 3, there is no difference in the files. Is that supposed to happen? Am I missing something?

I know the commands have different context, and I get that the "head" or "tail" is supposed to be pulling info from list entry 280 and 263 respectively, but the file "dump2.txt" and "dump3.txt" are exactly the same...?

Last edited by slacker_; 05-27-2014 at 10:48 PM.
 
Old 05-28-2014, 09:58 AM   #6
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 1,868

Rep: Reputation: 250Reputation: 250Reputation: 250
dump2.txt will have the first 280 lines.
dump3.txt will have the last 263 lines.

Those were examples of how to edit unwanted lines from a .txt file that you can then use with wget -i
 
Old 05-29-2014, 12:36 AM   #7
slacker_
Member
 
Registered: Aug 2013
Distribution: Arch, Debian, Slackware
Posts: 333

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by teckk View Post
dump2.txt will have the first 280 lines.
dump3.txt will have the last 263 lines.

Those were examples of how to edit unwanted lines from a .txt file that you can then use with wget -i
AH ok. When I compared them side by side, I started at the bottom and scrolled up. When I got into the 90s I was like "Yeah, these are identical" but that was a bad assumption. Thanks.

So I went through the rest and, wow... very quick and streamlined, very nice! I will need to read into using these tools more to understand better what's going on. Thanks!
 
Old 05-29-2014, 02:02 AM   #8
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 2,406

Rep: Reputation: 863Reputation: 863Reputation: 863Reputation: 863Reputation: 863Reputation: 863Reputation: 863
lol, all that effort to save $10
 
Old 05-29-2014, 02:26 AM   #9
slacker_
Member
 
Registered: Aug 2013
Distribution: Arch, Debian, Slackware
Posts: 333

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by TenTenths View Post
lol, all that effort to save $10
Close; All that effort to learn a new skill. Plus, ten bucks is a lot when you have only have ten cents to your name.
 
Old 05-29-2014, 02:29 AM   #10
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 2,406

Rep: Reputation: 863Reputation: 863Reputation: 863Reputation: 863Reputation: 863Reputation: 863Reputation: 863
Quote:
Originally Posted by slacker_ View Post
Close; All that effort to learn a new skill. Plus, ten bucks is a lot when you have only have ten cents to your name.
True! $10 is 1.5 beers here!
 
Old 06-02-2014, 01:45 PM   #11
slacker_
Member
 
Registered: Aug 2013
Distribution: Arch, Debian, Slackware
Posts: 333

Original Poster
Rep: Reputation: 3
Quote:
Originally Posted by TenTenths View Post
True! $10 is 1.5 beers here!
About 1.2 beers where I'm at. It's crazy!
 
  


Reply

Tags
book, download, html, php, wget


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Prevent wget to download index.html of 404 page unclesamcrazy Linux - Newbie 1 10-29-2013 03:00 AM
wget command doesnt download rpms but only index.html zulkifal Linux - Newbie 11 10-26-2012 08:33 PM
[SOLVED] wget -r doesn't download anything except index.html mahkoe Linux - General 4 06-12-2012 04:40 PM
[SOLVED] wget failed to download a html page moebus Linux - General 11 01-31-2012 09:58 PM
[SOLVED] How to use wget to download a html book. errigour Linux - Newbie 3 11-02-2011 07:20 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration