LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
LinkBack Search this Thread
Old 08-03-2010, 08:44 AM   #1
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Rep: Reputation: 15
wget (or other util): how do I mirror parts of this site?


I want to download all the comics directly linked to from this page:
http://disneycomics.free.fr/index_rosa.php
... but NOT any parent links, links to other authors, etc. Only those in the middle, and all their (scanned) pages.
The reason is, of course, that the page might go down some day, in which case... well, it would be inaccessible.

The optimal scenario would be that I get that index page, and can browse the pages just as I do online, with wget rewriting all URLs to be relative (the -k option IIRC).

The PROBLEM I'm having is that even if I try to download one comic at a time, it finds the link back to the index (upper left corner when viewing a comic page) and starts downloading the rest of the site. Since I don't want it, that's a giant waste of bandwidth for the site owner (doesn't matter to ME as I don't have a GB/month limit, but I'm trying to be as nice as possible here).

A solution for either downloading them all via wget, or downloading one at a time (e.g. http://disneycomics.free.fr/Ducks/Ro...?loc=D2002-033 - I'll grab the URLs using regexes) would be very welcome.

Of course, if I have to download them "manually", that might cause problems with directory naming instead. Still, that too should be extractable with an ugly perl-regex hack.
 
Old 08-04-2010, 11:31 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
You might check out httrack, which is a proper website mirrorer. You can set up filters so that it only downloads certain file-types and follows certain link patterns. There's also a web interface you can use with it (webhttrack). It's a bit complex to figure out at first, but it'll give you a lot more fine-grained control than wget offers.
 
Old 08-04-2010, 12:03 PM   #3
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
I tried mucking around with httrack earlier with little success - I only ever got it to download the first page (01.jpg). If only there was an index of all pages, this would be easy. As it is now, though, it has to go:
Index -> Comic page 1, save image, follow link to page 2 -> at comic page 2, save image, follow link to page 3 -> ...
for every comic. I just can't find out how to do so.
 
  


Reply

Tags
mirror, wget


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What's So Reliable About the wget Mirror Command vs Downloading Other Ways? des_a Linux - Software 0 03-12-2008 11:53 AM
Cannot access parts oif web site through squid ginda Linux - Server 1 04-07-2007 09:16 AM
Trouble accessing parts of MSNBC site PapaNoHair General 1 07-20-2005 09:41 AM
what to do with 5 parts of wget download Bruce Hill Linux - Software 2 09-11-2003 10:47 AM
wget and mirror scottrell Linux - General 1 05-30-2003 04:54 AM


All times are GMT -5. The time now is 09:01 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration