LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Is there a way to download an html/php book with wget or something? (https://www.linuxquestions.org/questions/linux-newbie-8/is-there-a-way-to-download-an-html-php-book-with-wget-or-something-4175506025/)

slacker_ 05-25-2014 03:08 AM

Is there a way to download an html/php book with wget or something?
 
As an example I am wanting to try to download this CentOS book. I found an old thread that here that uses wget as follows:

Code:

wget -r -np http://www.tldp.org/LDP/abs/html/index.html
1) that works well enough, but I would like it to sort by chapters if at all posible

and

2) it doesn't work for sites like the one linked above (CentOS book), I think because it is php? I'm not sure.

Does anyone have a solution for this?

teckk 05-25-2014 08:33 AM

There are a dozen ways to do that.

Code:

lynx -dump -listonly http://www.techotopia.com/index.php/CentOS_6_Essentials
You can put the chapter links into a text file like example:

getfile.txt
Code:

http://www.techotopia.com/index.php/Performing_a_CentOS_6_Network_Installation
http://www.techotopia.com/index.php/Installing_CentOS_6_with_Windows_in_a_Dual_Boot_Environment
http://www.techotopia.com/index.php/Allocating_a_Windows_Disk_Partition_to_CentOS_6
http://www.techotopia.com/index.php/Configuring_CentOS_6_GNOME_Screen_Resolution_and_Multiple_Monitors

Then to get the pages
Code:

wget -i getfile.txt
Then convert htmltopdf or the format you wish.

Or you could get a formatted .txt of a page with
Code:

lynx -dump http://www.techotopia.com/index.php/Performing_a_CentOS_6_Network_Installation > chapter3.txt
Or use you web browsers ability to print to .pdf for each page.

Or write a script to automate it.

slacker_ 05-25-2014 12:24 PM

Brilliant!

I know I could go page by page and just "save as pdf" but that is wayyy too slow. Is lynx dump a method that will pull each subsequent or linked to page? If not then what is the best automated method to do that?

If i was going to make a getfile, is there also an autmated method for grabbing the url of each link on a page like the one linked in OP? Each chapter and section is it's own link, which can be quite a lot of copy/pasting.

teckk 05-25-2014 03:18 PM

A basic tutorial for you. Do each and look at the results to get an understanding.

Code:

lynx -dump -listonly http://www.techotopia.com/index.php/CentOS_6_Essentials > dump.txt
Now take a look at dump.txt
Code:

cat dump.txt | head -n 280 > dump2.txt
Now look at dump2.txt
Code:

cat dump2.txt | tail -n 263 > dump3.txt
Now look at dump3.txt
Code:

cat dump3.txt | sed  's .\{6\}  ' > dump4.txt
Look at dump4.txt
Code:

cat dump4.txt | sort -u > dump5.txt
Look at dump5.txt
Code:

wget -U Mozilla/5.0 -i dump5.txt
You'll get about 38 pages.
If you want to convert them on the fly write a script using wget, curl, lynx, html2pdf, html2text etc. to do what you wish.
There are also apps like htmldoc.

Look at:
man cat
man wget
man head
man tail
man sort
man lynx
man sed

After understanding the above,
Code:

a=1; for i in $(cat dump5.txt); do lynx -dump $i > CentOS$a.txt; let a++; done
Which is the same as
Code:

#! /usr/bin/env bash

a=1
for i in $(cat dump5.txt); do
lynx -dump $i > CentOS$a.txt
let a++
done

Will convert those 38 pages to separate text files. If you want .pdf, .ps, .jpg etc. then use something else than lynx.
Edit: Fixed wget syntax and added example script.

slacker_ 05-27-2014 10:47 PM

between dump 2 and 3, there is no difference in the files. Is that supposed to happen? Am I missing something?

I know the commands have different context, and I get that the "head" or "tail" is supposed to be pulling info from list entry 280 and 263 respectively, but the file "dump2.txt" and "dump3.txt" are exactly the same...?

teckk 05-28-2014 09:58 AM

dump2.txt will have the first 280 lines.
dump3.txt will have the last 263 lines.

Those were examples of how to edit unwanted lines from a .txt file that you can then use with wget -i

slacker_ 05-29-2014 12:36 AM

Quote:

Originally Posted by teckk (Post 5178112)
dump2.txt will have the first 280 lines.
dump3.txt will have the last 263 lines.

Those were examples of how to edit unwanted lines from a .txt file that you can then use with wget -i

AH ok. When I compared them side by side, I started at the bottom and scrolled up. When I got into the 90s I was like "Yeah, these are identical" but that was a bad assumption. Thanks.

So I went through the rest and, wow... very quick and streamlined, very nice! I will need to read into using these tools more to understand better what's going on. Thanks!

TenTenths 05-29-2014 02:02 AM

lol, all that effort to save $10 :)

slacker_ 05-29-2014 02:26 AM

Quote:

Originally Posted by TenTenths (Post 5178524)
lol, all that effort to save $10 :)

Close; All that effort to learn a new skill. Plus, ten bucks is a lot when you have only have ten cents to your name.

TenTenths 05-29-2014 02:29 AM

Quote:

Originally Posted by slacker_ (Post 5178531)
Close; All that effort to learn a new skill. Plus, ten bucks is a lot when you have only have ten cents to your name.

True! $10 is 1.5 beers here! :)

slacker_ 06-02-2014 01:45 PM

Quote:

Originally Posted by TenTenths (Post 5178532)
True! $10 is 1.5 beers here! :)

About 1.2 beers where I'm at. It's crazy!


All times are GMT -5. The time now is 06:51 PM.