LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   General (http://www.linuxquestions.org/questions/general-10/)
-   -   How to download a whole site? (http://www.linuxquestions.org/questions/general-10/how-to-download-a-whole-site-333197/)

Hosiah 06-13-2005 03:43 PM

How to download a whole site?
 
I have multiple computers in the household, only one of which gets the internet. What I need to do is be able to take a whole site (usually an FAQ, a guide, a free online book, etc) mirror the whole thing on my hard drive, where I can tarball and gzip it, floppy copy it, and move it to the machine where I'll be needing it on hand. Preferably quickly, because everybody in the house uses the internet machine.

Now, I tried using the "getlinks" script from Wicked Cool Shell Scripts ( here: http://www.intuitive.com/wicked/ ) , combined with a script I wrote (posted at my blog: http://hackersnest.modblog.com/?show...blog_id=615755 ), and it actually worked for a couple sites. Unfortunately, the entire internet features hundreds of different site-indexing methods, each incompatible with this method in their own unique way, and I'm constantly re-writing this script over and over to deal with each site's quirks. It seems like every time I find a more general-purpose solution, three more exceptions are discovered which break it!

Now, I have http://www.slackware.com/book/ , which uses some kind of scheme so that even the lynx -source | getlinks script combo doesn't work. Has anybody ever found an all-purpose, one-shot tool for Linux to do this?

My distros? I use Red Hat 9.0, Slackware 10.1, Debian 3.1 (barely), D*mn Small Linux version-I-forget, Knoppix Live CD 3.7 and Mepis Zeddy. The one with the internet connection is Red Hat/dual booted with Lose^H^H^H^H Win98.

PS I dont care about getting pictures/whistles/bells/etc. Just the plain 'ol text would be fine.

PPS edit: I got lucky and found the .org site's link to download the tarball, so the specific case is over...but I still need the general case solution!

trickykid 06-13-2005 03:58 PM

You can use wget to mirror a site, basically pull down everything from it..

man wget

AlexV 06-13-2005 06:57 PM

But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.

You have been warned! ;)

Hosiah 06-13-2005 08:12 PM

hey! wget is working! (-:

I guess the old saying is true: those who know not of wget are doomed to reinvent it - poorly!

Quote:

But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
*nervous gulp* -uh... (as the script runs in another desktop) ...isn't it sufficient to use the "-w" option to wait the integer amount of time so the site isn't slammed?

AlexV 06-13-2005 08:39 PM

Quote:

Originally posted by Hosiah

*nervous gulp* -uh... (as the script runs in another desktop) ...isn't it sufficient to use the "-w" option to wait the integer amount of time so the site isn't slammed?

I wouldn't worry about it too much. I've downloaded many sites without complaint from their owners. Some specifically ask visitors not too, in which case you should (of course) not do it.

For instance, if some of our fellow members decided that it would be cool to try downloading LQ, Jeremy probably would not be amused ;)
(and you would almost certainly run out of disk space)

trickykid 06-14-2005 07:47 AM

Quote:

Originally posted by AlexV
For instance, if some of our fellow members decided that it would be cool to try downloading LQ, Jeremy probably would not be amused ;)
(and you would almost certainly run out of disk space)

Run out of disk space? Pah.. I think the last time Jeremy checked and told us the size of the database where 90% of the content here is stored at, it was only a few hundred megs.. it'll be a while before you'd run out of space if you tried it.. perhaps in another 10 years.. ;)

bulliver 06-14-2005 11:01 AM

Quote:

But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
So use --user-agent="googlebot" and blame it on google :)

lowpingnoob 06-15-2005 02:03 AM

Quote:

Originally posted by bulliver
So use --user-agent="googlebot" and blame it on google :)
they could trace it .... if my site was brought down by a loser, I would go to extreme lengths to track them down, find their physicial address, and beat the bejeesus out of them. I hosted an ftp on my computer at one point, and someone installed a (2 GB install) straight from my computer. Slowed me down to a crawl in everything for.... 1 whole day. Traced his IP, and lets just say, his network wasn't having any luck finding bandwith to connect to the internet for a week or so. Mean, but I think it was intentional what he did, oh well.


All times are GMT -5. The time now is 04:44 AM.