downloading entire websites for burning to DVD

XavierOnassis · 08-04-2007, 11:50 AM

I know wget can handle downloading entire websites, but how do you get a site that requires a login?.... where in the syntax would I put my user name and password - or is there no way to do that?

pixellany · 08-04-2007, 12:14 PM

I don't know the details of wget---have you checked the documentation, man page, etc.? (I skimmed and found several references to passwords, logins, etc.)

I have be using Twill---which I found easier to understand. It provides for entering passwords and other info, clicking buttons, etc.

jiml8 · 08-04-2007, 12:21 PM

When my site catches you doing that, it'll automatically blacklist you.

pixellany · 08-04-2007, 12:24 PM

Quote:

Originally Posted by jiml8

When my site catches you doing that, it'll automatically blacklist you.

OK---instead of hijacking the thread, how about if I "invert" it??

How do you set up a website to detect if someone is trying to get all the pages?

jiml8 · 08-04-2007, 05:22 PM

I won't be too specific because that would point to how to defeat the procedures. But, essentially, it is a combination of honeypots and pattern matching. My site has locations on it that a human will never, ever find but a site scraper will find. I link these locations using a one pixel gif which is physically hidden behind another gif on the page. Humans will never reach it, but a site scraper will find the link, follow it, and get immediately blacklisted. I keep search engines that I want visiting the site from falling into the honeypots with .htaccess entries.

Pattern matching looks for things that are commonly done by scrapers; too many pages in too short a time; downloading pages while not downloading images; downloading images while not downloading pages - things like that.

jiml8 · 08-04-2007, 05:27 PM

Oh. wget:

wget myusername:mypassword@mytargetsite.com

or something quite close to that.

teckk · 08-05-2007, 05:34 PM

http://linuxcommand.org/man_pages/wget1.html

XavierOnassis · 08-06-2007, 08:17 AM

Quote:

Originally Posted by pixellany

I don't know the details of wget---have you checked the documentation, man page, etc.? (I skimmed and found several references to passwords, logins, etc.)

I have be using Twill---which I found easier to understand. It provides for entering passwords and other info, clicking buttons, etc.

Doh ....man pages!! What a concept.

Thanks.

I'll give Twill a looksee.