wget (or other util): how do I mirror parts of this site?

exscape · 08-03-2010, 08:44 AM

I want to download all the comics directly linked to from this page:
http://disneycomics.free.fr/index_rosa.php
... but NOT any parent links, links to other authors, etc. Only those in the middle, and all their (scanned) pages.
The reason is, of course, that the page might go down some day, in which case... well, it would be inaccessible.

The optimal scenario would be that I get that index page, and can browse the pages just as I do online, with wget rewriting all URLs to be relative (the -k option IIRC).

The PROBLEM I'm having is that even if I try to download one comic at a time, it finds the link back to the index (upper left corner when viewing a comic page) and starts downloading the rest of the site. Since I don't want it, that's a giant waste of bandwidth for the site owner (doesn't matter to ME as I don't have a GB/month limit, but I'm trying to be as nice as possible here).

A solution for either downloading them all via wget, or downloading one at a time (e.g. http://disneycomics.free.fr/Ducks/Ro...?loc=D2002-033 - I'll grab the URLs using regexes) would be very welcome.

Of course, if I have to download them "manually", that might cause problems with directory naming instead. Still, that too should be extractable with an ugly perl-regex hack.

David the H. · 08-04-2010, 11:31 AM

You might check out httrack, which is a proper website mirrorer. You can set up filters so that it only downloads certain file-types and follows certain link patterns. There's also a web interface you can use with it (webhttrack). It's a bit complex to figure out at first, but it'll give you a lot more fine-grained control than wget offers.

exscape · 08-04-2010, 12:03 PM

I tried mucking around with httrack earlier with little success - I only ever got it to download the first page (01.jpg). If only there was an index of all pages, this would be easy. As it is now, though, it has to go:
Index -> Comic page 1, save image, follow link to page 2 -> at comic page 2, save image, follow link to page 3 -> ...
for every comic. I just can't find out how to do so.