| General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun! |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
06-13-2005, 03:43 PM
|
#1
|
|
Member
Registered: Sep 2004
Location: Des Moines, Iowa
Distribution: Slackware, Mandriva, Debian derivatives, +BSD/ Solaris/ Minix/ plan9/ GNU/HURD...
Posts: 185
Rep:
|
How to download a whole site?
I have multiple computers in the household, only one of which gets the internet. What I need to do is be able to take a whole site (usually an FAQ, a guide, a free online book, etc) mirror the whole thing on my hard drive, where I can tarball and gzip it, floppy copy it, and move it to the machine where I'll be needing it on hand. Preferably quickly, because everybody in the house uses the internet machine.
Now, I tried using the "getlinks" script from Wicked Cool Shell Scripts ( here: http://www.intuitive.com/wicked/ ) , combined with a script I wrote (posted at my blog: http://hackersnest.modblog.com/?show...blog_id=615755 ), and it actually worked for a couple sites. Unfortunately, the entire internet features hundreds of different site-indexing methods, each incompatible with this method in their own unique way, and I'm constantly re-writing this script over and over to deal with each site's quirks. It seems like every time I find a more general-purpose solution, three more exceptions are discovered which break it!
Now, I have http://www.slackware.com/book/ , which uses some kind of scheme so that even the lynx -source | getlinks script combo doesn't work. Has anybody ever found an all-purpose, one-shot tool for Linux to do this?
My distros? I use Red Hat 9.0, Slackware 10.1, Debian 3.1 (barely), D*mn Small Linux version-I-forget, Knoppix Live CD 3.7 and Mepis Zeddy. The one with the internet connection is Red Hat/dual booted with Lose^H^H^H^H Win98.
PS I dont care about getting pictures/whistles/bells/etc. Just the plain 'ol text would be fine.
PPS edit: I got lucky and found the .org site's link to download the tarball, so the specific case is over...but I still need the general case solution!
Last edited by Hosiah; 06-13-2005 at 03:57 PM.
|
|
|
|
06-13-2005, 03:58 PM
|
#2
|
|
Guru
Registered: Jan 2001
Posts: 24,128
Rep: 
|
You can use wget to mirror a site, basically pull down everything from it..
man wget
|
|
|
|
06-13-2005, 06:57 PM
|
#3
|
|
Member
Registered: May 2004
Location: New Lenox, IL
Distribution: Fedora Core 4; Ubuntu 5.10 (Breezy Preview); CentOS 4
Posts: 81
Rep:
|
But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
You have been warned! 
|
|
|
|
06-13-2005, 08:12 PM
|
#4
|
|
Member
Registered: Sep 2004
Location: Des Moines, Iowa
Distribution: Slackware, Mandriva, Debian derivatives, +BSD/ Solaris/ Minix/ plan9/ GNU/HURD...
Posts: 185
Original Poster
Rep:
|
hey! wget is working! (-:
I guess the old saying is true: those who know not of wget are doomed to reinvent it - poorly!
Quote:
|
But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
|
*nervous gulp* -uh... (as the script runs in another desktop) ...isn't it sufficient to use the "-w" option to wait the integer amount of time so the site isn't slammed?
|
|
|
|
06-13-2005, 08:39 PM
|
#5
|
|
Member
Registered: May 2004
Location: New Lenox, IL
Distribution: Fedora Core 4; Ubuntu 5.10 (Breezy Preview); CentOS 4
Posts: 81
Rep:
|
Quote:
Originally posted by Hosiah
*nervous gulp* -uh... (as the script runs in another desktop) ...isn't it sufficient to use the "-w" option to wait the integer amount of time so the site isn't slammed?
|
I wouldn't worry about it too much. I've downloaded many sites without complaint from their owners. Some specifically ask visitors not too, in which case you should (of course) not do it.
For instance, if some of our fellow members decided that it would be cool to try downloading LQ, Jeremy probably would not be amused
(and you would almost certainly run out of disk space)
Last edited by AlexV; 06-13-2005 at 08:40 PM.
|
|
|
|
06-14-2005, 07:47 AM
|
#6
|
|
Guru
Registered: Jan 2001
Posts: 24,128
Rep: 
|
Quote:
Originally posted by AlexV
For instance, if some of our fellow members decided that it would be cool to try downloading LQ, Jeremy probably would not be amused
(and you would almost certainly run out of disk space)
|
Run out of disk space? Pah.. I think the last time Jeremy checked and told us the size of the database where 90% of the content here is stored at, it was only a few hundred megs.. it'll be a while before you'd run out of space if you tried it.. perhaps in another 10 years.. 
|
|
|
|
06-14-2005, 11:01 AM
|
#7
|
|
Senior Member
Registered: Nov 2002
Location: Edmonton AB, Canada
Distribution: Gentoo x86; Gentoo PPC; Gentoo Sparc64; FreeBSD; OS X; Solaris
Posts: 3,731
Rep:
|
Quote:
|
But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
|
So use --user-agent="googlebot" and blame it on google 
|
|
|
|
06-15-2005, 02:03 AM
|
#8
|
|
Member
Registered: Jun 2005
Distribution: Fedora Core 3, soon DSL (DSL backwards is LSD hahahaha)
Posts: 245
Rep:
|
Quote:
Originally posted by bulliver
So use --user-agent="googlebot" and blame it on google
|
they could trace it .... if my site was brought down by a loser, I would go to extreme lengths to track them down, find their physicial address, and beat the bejeesus out of them. I hosted an ftp on my computer at one point, and someone installed a (2 GB install) straight from my computer. Slowed me down to a crawl in everything for.... 1 whole day. Traced his IP, and lets just say, his network wasn't having any luck finding bandwith to connect to the internet for a week or so. Mean, but I think it was intentional what he did, oh well.
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 05:12 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|