Crawler Issue

fjcaetano · 12-02-2010, 11:06 AM

Hi guys,

This is my first post here, so I'll try to be simple.

A few months ago I developed a crawler in python for a college project wich was based on performance. With time, it got bigger and bigger and I've bumped in the major crawler issue: I've been blocked in some websites.

First of all, I'd like to say that I do use some good revisitation policy and yet it happened.

I thought in some solutions for the problem but I'm also open to sugestions:

Proxy - ( Problem: Besides using bandwith of another server, it also conflicts with the performance item that is the major priority in the project )
Hide IP - ( I don't know if it's possible. I thought in some program sniffing the packets and changing the IP )

So, what do you have to say? Did you had the same problem and found a solution?

EDIT: I'd just like to clarify that it's not a college project anymore. It started as one, but I already graduated.

jiml8 · 12-02-2010, 02:11 PM

The solution is to not crawl sites that don't want you there.

Every crawler my site detects gets the IP address permanently blocked. And my honeypot is pretty good at finding crawlers; it positively WILL detect any crawler that ignores robots.txt, or that revisits too frequently. I worked on that for awhile and I'm pretty proud of it.

It doesn't actually block the website, it just redirects every attempt at access by that IP address to the special "go away, crawler" page, and waits several seconds before responding. So the crawler gets fed pages very slowly and those pages are useless to the crawler. Wastes a lot of the crawler's time. I like doing that.

Edit:

I wasn't going to mention this, but after thinking about it for awhile I will mention it.

I have run across a few crawlers that just won't take no for an answer. Usually, if you are checking your logs, you can eventually pick them out. I COULD just keep blocking them; the honeypot works. But people like that just really piss me off.

So, on those few occasions where someone just won't quit, I have a special payload that I keep around just for them. Works best in Windows, of course, since Windows is just more vulnerable. But it will also do some damage in Linux, depending on how well the linux system is secured and what permissions the crawler has.

So, you just go to work and try to find a way past my honeypot. Try to find a way to crawl my site when I've told you not to. If I identify your crawler in my logs, well...I have a surprise for you.

fjcaetano · 12-03-2010, 10:59 AM

Hi jiml8,

Thank you for your answer. It's good to know that there are people hostile to crawler and what's their policy.

But I intend on building a startup based on e-commerce data and I can't afford not to crawl some sites. So, although I respect and understand the website's need to avoid webcrawlers, I still need their informations.

Thanks

jiml8 · 12-03-2010, 03:24 PM

Then try to crawl my site after I have forbidden you to. I have a surprise package for you. We'll see how your ecommerce thing goes when you lose lots of data.

fjcaetano · 12-06-2010, 03:02 PM

So, does anyone else has anything to say about this?

jiml8, instead of being hostile, you can face this topic as an oportunity to improve your anti-crawler policy.

MTK358 · 12-06-2010, 03:44 PM

I hate targeted ads and data gathering.

stress_junkie · 12-06-2010, 03:48 PM

Quote:

Originally Posted by MTK358

I hate targeted ads and data gathering.

You must hate Google. Yet where would we be without it?