LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > General
User Name
Password
General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!

Notices



Reply
 
Search this Thread
Old 06-13-2005, 04:43 PM   #1
Hosiah
Member
 
Registered: Sep 2004
Location: Des Moines, Iowa
Distribution: Slackware, Mandriva, Debian derivatives, +BSD/ Solaris/ Minix/ plan9/ GNU/HURD...
Posts: 185

Rep: Reputation: 30
How to download a whole site?


I have multiple computers in the household, only one of which gets the internet. What I need to do is be able to take a whole site (usually an FAQ, a guide, a free online book, etc) mirror the whole thing on my hard drive, where I can tarball and gzip it, floppy copy it, and move it to the machine where I'll be needing it on hand. Preferably quickly, because everybody in the house uses the internet machine.

Now, I tried using the "getlinks" script from Wicked Cool Shell Scripts ( here: http://www.intuitive.com/wicked/ ) , combined with a script I wrote (posted at my blog: http://hackersnest.modblog.com/?show...blog_id=615755 ), and it actually worked for a couple sites. Unfortunately, the entire internet features hundreds of different site-indexing methods, each incompatible with this method in their own unique way, and I'm constantly re-writing this script over and over to deal with each site's quirks. It seems like every time I find a more general-purpose solution, three more exceptions are discovered which break it!

Now, I have http://www.slackware.com/book/ , which uses some kind of scheme so that even the lynx -source | getlinks script combo doesn't work. Has anybody ever found an all-purpose, one-shot tool for Linux to do this?

My distros? I use Red Hat 9.0, Slackware 10.1, Debian 3.1 (barely), D*mn Small Linux version-I-forget, Knoppix Live CD 3.7 and Mepis Zeddy. The one with the internet connection is Red Hat/dual booted with Lose^H^H^H^H Win98.

PS I dont care about getting pictures/whistles/bells/etc. Just the plain 'ol text would be fine.

PPS edit: I got lucky and found the .org site's link to download the tarball, so the specific case is over...but I still need the general case solution!

Last edited by Hosiah; 06-13-2005 at 04:57 PM.
 
Old 06-13-2005, 04:58 PM   #2
trickykid
Guru
 
Registered: Jan 2001
Posts: 24,133

Rep: Reputation: 199Reputation: 199
You can use wget to mirror a site, basically pull down everything from it..

man wget
 
Old 06-13-2005, 07:57 PM   #3
AlexV
Member
 
Registered: May 2004
Location: New Lenox, IL
Distribution: Fedora Core 4; Ubuntu 5.10 (Breezy Preview); CentOS 4
Posts: 81

Rep: Reputation: 15
But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.

You have been warned!
 
Old 06-13-2005, 09:12 PM   #4
Hosiah
Member
 
Registered: Sep 2004
Location: Des Moines, Iowa
Distribution: Slackware, Mandriva, Debian derivatives, +BSD/ Solaris/ Minix/ plan9/ GNU/HURD...
Posts: 185

Original Poster
Rep: Reputation: 30
hey! wget is working! (-:

I guess the old saying is true: those who know not of wget are doomed to reinvent it - poorly!

Quote:
But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
*nervous gulp* -uh... (as the script runs in another desktop) ...isn't it sufficient to use the "-w" option to wait the integer amount of time so the site isn't slammed?
 
Old 06-13-2005, 09:39 PM   #5
AlexV
Member
 
Registered: May 2004
Location: New Lenox, IL
Distribution: Fedora Core 4; Ubuntu 5.10 (Breezy Preview); CentOS 4
Posts: 81

Rep: Reputation: 15
Quote:
Originally posted by Hosiah

*nervous gulp* -uh... (as the script runs in another desktop) ...isn't it sufficient to use the "-w" option to wait the integer amount of time so the site isn't slammed?
I wouldn't worry about it too much. I've downloaded many sites without complaint from their owners. Some specifically ask visitors not too, in which case you should (of course) not do it.

For instance, if some of our fellow members decided that it would be cool to try downloading LQ, Jeremy probably would not be amused
(and you would almost certainly run out of disk space)

Last edited by AlexV; 06-13-2005 at 09:40 PM.
 
Old 06-14-2005, 08:47 AM   #6
trickykid
Guru
 
Registered: Jan 2001
Posts: 24,133

Rep: Reputation: 199Reputation: 199
Quote:
Originally posted by AlexV
For instance, if some of our fellow members decided that it would be cool to try downloading LQ, Jeremy probably would not be amused
(and you would almost certainly run out of disk space)
Run out of disk space? Pah.. I think the last time Jeremy checked and told us the size of the database where 90% of the content here is stored at, it was only a few hundred megs.. it'll be a while before you'd run out of space if you tried it.. perhaps in another 10 years..
 
Old 06-14-2005, 12:01 PM   #7
bulliver
Senior Member
 
Registered: Nov 2002
Location: Edmonton AB, Canada
Distribution: Gentoo x86_64; Gentoo PPC; FreeBSD; OS X 10.9.4
Posts: 3,760
Blog Entries: 4

Rep: Reputation: 77
Quote:
But be careful! Some admins may not take kindly to someone ripping their entire site! It eats up a lot of their bandwidth and other resources.
So use --user-agent="googlebot" and blame it on google
 
Old 06-15-2005, 03:03 AM   #8
lowpingnoob
Member
 
Registered: Jun 2005
Distribution: Fedora Core 3, soon DSL (DSL backwards is LSD hahahaha)
Posts: 245

Rep: Reputation: 30
Quote:
Originally posted by bulliver
So use --user-agent="googlebot" and blame it on google
they could trace it .... if my site was brought down by a loser, I would go to extreme lengths to track them down, find their physicial address, and beat the bejeesus out of them. I hosted an ftp on my computer at one point, and someone installed a (2 GB install) straight from my computer. Slowed me down to a crawl in everything for.... 1 whole day. Traced his IP, and lets just say, his network wasn't having any luck finding bandwith to connect to the internet for a week or so. Mean, but I think it was intentional what he did, oh well.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast Download Site for Linux, enables >4GB Download of single file TigerLinux Linux - Distributions 9 10-29-2005 01:45 PM
How to download a site of document lomnhom Linux - Newbie 2 03-02-2005 03:00 AM
download 2 files site request aus9 General 3 09-15-2004 01:54 AM
wikipedia site download. arunshivanandan General 2 02-07-2004 01:18 AM
Slackers - Download site! DavidPhillips Linux - General 1 12-19-2001 09:31 PM


All times are GMT -5. The time now is 05:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration