Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
This is my first post here, so I'll try to be simple.
A few months ago I developed a crawler in python for a college project wich was based on performance. With time, it got bigger and bigger and I've bumped in the major crawler issue: I've been blocked in some websites.
First of all, I'd like to say that I do use some good revisitation policy and yet it happened.
I thought in some solutions for the problem but I'm also open to sugestions:
Proxy - ( Problem: Besides using bandwith of another server, it also conflicts with the performance item that is the major priority in the project )
Hide IP - ( I don't know if it's possible. I thought in some program sniffing the packets and changing the IP )
So, what do you have to say? Did you had the same problem and found a solution?
EDIT: I'd just like to clarify that it's not a college project anymore. It started as one, but I already graduated.
Last edited by fjcaetano; 12-02-2010 at 11:21 AM.
Reason: more information
The solution is to not crawl sites that don't want you there.
Every crawler my site detects gets the IP address permanently blocked. And my honeypot is pretty good at finding crawlers; it positively WILL detect any crawler that ignores robots.txt, or that revisits too frequently. I worked on that for awhile and I'm pretty proud of it.
It doesn't actually block the website, it just redirects every attempt at access by that IP address to the special "go away, crawler" page, and waits several seconds before responding. So the crawler gets fed pages very slowly and those pages are useless to the crawler. Wastes a lot of the crawler's time. I like doing that.
Edit:
I wasn't going to mention this, but after thinking about it for awhile I will mention it.
I have run across a few crawlers that just won't take no for an answer. Usually, if you are checking your logs, you can eventually pick them out. I COULD just keep blocking them; the honeypot works. But people like that just really piss me off.
So, on those few occasions where someone just won't quit, I have a special payload that I keep around just for them. Works best in Windows, of course, since Windows is just more vulnerable. But it will also do some damage in Linux, depending on how well the linux system is secured and what permissions the crawler has.
So, you just go to work and try to find a way past my honeypot. Try to find a way to crawl my site when I've told you not to. If I identify your crawler in my logs, well...I have a surprise for you.
Thank you for your answer. It's good to know that there are people hostile to crawler and what's their policy.
But I intend on building a startup based on e-commerce data and I can't afford not to crawl some sites. So, although I respect and understand the website's need to avoid webcrawlers, I still need their informations.
Then try to crawl my site after I have forbidden you to. I have a surprise package for you. We'll see how your ecommerce thing goes when you lose lots of data.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.