LinuxQuestions.org - [SOLVED] Is there a way to download an html/php book with wget or something?

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Is there a way to download an html/php book with wget or something? (https://www.linuxquestions.org/questions/linux-newbie-8/is-there-a-way-to-download-an-html-php-book-with-wget-or-something-4175506025/)

Is there a way to download an html/php book with wget or something?

As an example I am wanting to try to download this CentOS book. I found an old thread that here that uses wget as follows:

Code:

wget -r -np http://www.tldp.org/LDP/abs/html/index.html

1) that works well enough, but I would like it to sort by chapters if at all posible

and

2) it doesn't work for sites like the one linked above (CentOS book), I think because it is php? I'm not sure.

Does anyone have a solution for this?

There are a dozen ways to do that.

Code:

lynx -dump -listonly http://www.techotopia.com/index.php/CentOS_6_Essentials

You can put the chapter links into a text file like example:

getfile.txt

Code:

http://www.techotopia.com/index.php/Performing_a_CentOS_6_Network_Installation

http://www.techotopia.com/index.php/Installing_CentOS_6_with_Windows_in_a_Dual_Boot_Environment

http://www.techotopia.com/index.php/Allocating_a_Windows_Disk_Partition_to_CentOS_6

http://www.techotopia.com/index.php/Configuring_CentOS_6_GNOME_Screen_Resolution_and_Multiple_Monitors

Then to get the pages

Code:

wget -i getfile.txt

Then convert htmltopdf or the format you wish.

Or you could get a formatted .txt of a page with

Code:

lynx -dump http://www.techotopia.com/index.php/Performing_a_CentOS_6_Network_Installation > chapter3.txt

Or use you web browsers ability to print to .pdf for each page.

Or write a script to automate it.

Brilliant!

I know I could go page by page and just "save as pdf" but that is wayyy too slow. Is lynx dump a method that will pull each subsequent or linked to page? If not then what is the best automated method to do that?

If i was going to make a getfile, is there also an autmated method for grabbing the url of each link on a page like the one linked in OP? Each chapter and section is it's own link, which can be quite a lot of copy/pasting.

A basic tutorial for you. Do each and look at the results to get an understanding.

Code:

lynx -dump -listonly http://www.techotopia.com/index.php/CentOS_6_Essentials > dump.txt

Now take a look at dump.txt

Code:

cat dump.txt | head -n 280 > dump2.txt

Now look at dump2.txt

Code:

cat dump2.txt | tail -n 263 > dump3.txt

Now look at dump3.txt

Code:

cat dump3.txt | sed 's .\{6\} ' > dump4.txt

Look at dump4.txt

Code:

cat dump4.txt | sort -u > dump5.txt

Look at dump5.txt

Code:

wget -U Mozilla/5.0 -i dump5.txt

You'll get about 38 pages.
If you want to convert them on the fly write a script using wget, curl, lynx, html2pdf, html2text etc. to do what you wish.
There are also apps like htmldoc.

Look at:
man cat
man wget
man head
man tail
man sort
man lynx
man sed

After understanding the above,

Code:

a=1; for i in $(cat dump5.txt); do lynx -dump $i > CentOS$a.txt; let a++; done

Which is the same as

Code:

#! /usr/bin/env bash



a=1

for i in $(cat dump5.txt); do

lynx -dump $i > CentOS$a.txt

let a++

done

Will convert those 38 pages to separate text files. If you want .pdf, .ps, .jpg etc. then use something else than lynx.
Edit: Fixed wget syntax and added example script.

between dump 2 and 3, there is no difference in the files. Is that supposed to happen? Am I missing something?

I know the commands have different context, and I get that the "head" or "tail" is supposed to be pulling info from list entry 280 and 263 respectively, but the file "dump2.txt" and "dump3.txt" are exactly the same...?

dump2.txt will have the first 280 lines.
dump3.txt will have the last 263 lines.

Those were examples of how to edit unwanted lines from a .txt file that you can then use with wget -i

Quote:

Originally Posted by teckk (Post 5178112)

dump2.txt will have the first 280 lines.
dump3.txt will have the last 263 lines.

Those were examples of how to edit unwanted lines from a .txt file that you can then use with wget -i

AH ok. When I compared them side by side, I started at the bottom and scrolled up. When I got into the 90s I was like "Yeah, these are identical" but that was a bad assumption. Thanks.

So I went through the rest and, wow... very quick and streamlined, very nice! I will need to read into using these tools more to understand better what's going on. Thanks!

lol, all that effort to save $10 :)

Quote:

Originally Posted by TenTenths (Post 5178524)

lol, all that effort to save $10 :)

Close; All that effort to learn a new skill. Plus, ten bucks is a lot when you have only have ten cents to your name.

Quote:

Originally Posted by slacker_ (Post 5178531)

Close; All that effort to learn a new skill. Plus, ten bucks is a lot when you have only have ten cents to your name.

True! $10 is 1.5 beers here! :)

Quote:

Originally Posted by TenTenths (Post 5178532)

True! $10 is 1.5 beers here! :)

About 1.2 beers where I'm at. It's crazy!