LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 12-02-2010, 11:06 AM   #1
fjcaetano
LQ Newbie
 
Registered: Dec 2010
Posts: 3

Rep: Reputation: 0
Crawler Issue


Hi guys,

This is my first post here, so I'll try to be simple.

A few months ago I developed a crawler in python for a college project wich was based on performance. With time, it got bigger and bigger and I've bumped in the major crawler issue: I've been blocked in some websites.

First of all, I'd like to say that I do use some good revisitation policy and yet it happened.

I thought in some solutions for the problem but I'm also open to sugestions:
  1. Proxy - ( Problem: Besides using bandwith of another server, it also conflicts with the performance item that is the major priority in the project )
  2. Hide IP - ( I don't know if it's possible. I thought in some program sniffing the packets and changing the IP )

So, what do you have to say? Did you had the same problem and found a solution?

EDIT: I'd just like to clarify that it's not a college project anymore. It started as one, but I already graduated.

Last edited by fjcaetano; 12-02-2010 at 11:21 AM. Reason: more information
 
Old 12-02-2010, 02:11 PM   #2
jiml8
Senior Member
 
Registered: Sep 2003
Posts: 3,171

Rep: Reputation: 116Reputation: 116
The solution is to not crawl sites that don't want you there.

Every crawler my site detects gets the IP address permanently blocked. And my honeypot is pretty good at finding crawlers; it positively WILL detect any crawler that ignores robots.txt, or that revisits too frequently. I worked on that for awhile and I'm pretty proud of it.

It doesn't actually block the website, it just redirects every attempt at access by that IP address to the special "go away, crawler" page, and waits several seconds before responding. So the crawler gets fed pages very slowly and those pages are useless to the crawler. Wastes a lot of the crawler's time. I like doing that.

Edit:

I wasn't going to mention this, but after thinking about it for awhile I will mention it.

I have run across a few crawlers that just won't take no for an answer. Usually, if you are checking your logs, you can eventually pick them out. I COULD just keep blocking them; the honeypot works. But people like that just really piss me off.

So, on those few occasions where someone just won't quit, I have a special payload that I keep around just for them. Works best in Windows, of course, since Windows is just more vulnerable. But it will also do some damage in Linux, depending on how well the linux system is secured and what permissions the crawler has.

So, you just go to work and try to find a way past my honeypot. Try to find a way to crawl my site when I've told you not to. If I identify your crawler in my logs, well...I have a surprise for you.

Last edited by jiml8; 12-02-2010 at 10:36 PM.
 
Old 12-03-2010, 10:59 AM   #3
fjcaetano
LQ Newbie
 
Registered: Dec 2010
Posts: 3

Original Poster
Rep: Reputation: 0
Hi jiml8,

Thank you for your answer. It's good to know that there are people hostile to crawler and what's their policy.

But I intend on building a startup based on e-commerce data and I can't afford not to crawl some sites. So, although I respect and understand the website's need to avoid webcrawlers, I still need their informations.

Thanks
 
Old 12-03-2010, 03:24 PM   #4
jiml8
Senior Member
 
Registered: Sep 2003
Posts: 3,171

Rep: Reputation: 116Reputation: 116
Then try to crawl my site after I have forbidden you to. I have a surprise package for you. We'll see how your ecommerce thing goes when you lose lots of data.
 
Old 12-06-2010, 03:02 PM   #5
fjcaetano
LQ Newbie
 
Registered: Dec 2010
Posts: 3

Original Poster
Rep: Reputation: 0
So, does anyone else has anything to say about this?

jiml8, instead of being hostile, you can face this topic as an oportunity to improve your anti-crawler policy.
 
Old 12-06-2010, 03:44 PM   #6
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 723Reputation: 723Reputation: 723Reputation: 723Reputation: 723Reputation: 723Reputation: 723
I hate targeted ads and data gathering.
 
Old 12-06-2010, 03:48 PM   #7
stress_junkie
Senior Member
 
Registered: Dec 2005
Location: Massachusetts, USA
Distribution: Ubuntu 10.04 and CentOS 5.5
Posts: 3,873

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
Quote:
Originally Posted by MTK358 View Post
I hate targeted ads and data gathering.
You must hate Google. Yet where would we be without it?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Hardware Crawler Ancient12 Linux - Hardware 1 06-12-2006 03:00 PM
web crawler/viewer microsoft/linux General 10 05-07-2006 02:31 AM
wget as web spider/crawler kpachopoulos Linux - Software 2 08-27-2005 12:58 PM
Which is the widely used and best opensource crawler? coolguy_iiit Linux - Networking 1 01-08-2005 07:56 PM
linux web crawler demmylls Linux - Software 2 03-06-2004 08:56 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration