LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 05-30-2006, 08:24 AM   #1
chutsu
Member
 
Registered: Nov 2003
Location: UK
Distribution: Debian Lenny
Posts: 250

Rep: Reputation: 31
How to program Web Bots???


Hi
for those who know what web bots are I was wondering if anyone knows where to find books on how to program web spiders in Python or C?? and that it is updated, because I stumbled on the book by O'reilly "Spidering Hacks" which sort of throw bits to you about how to program web bots using Perl, but I don't want to learn another computer lanuage....and plus the book was written a few years ago....any new updated books that are published??, THANKS
Chris
 
Old 05-30-2006, 09:48 AM   #2
slzckboy
Member
 
Registered: May 2005
Location: uk - Reading
Distribution: slack 10.2 kde 3.4.2 kernel 2.6.15
Posts: 452

Rep: Reputation: 30
Like a web spider?

I just googled for my research,as I wanted to do one in C more as a learning exercise than anything else.


If you go the C route,you may want to look at the libxml library at xmlsoft.org.

It has libs for parsing html handling uri's and they also have a HTTP implementation,(although I didn't like it so I wrote my own

You will want to research the robots exclusion policy and robot etiquette and also consider how you are going to sort and manage all that data.
A robot soon amasses thousands of links,so you will want to be clear in your own mind how your robot will process manage and store the data of interest.Something I am still trying to fine tune in my project.

best of luck.
 
Old 05-30-2006, 12:42 PM   #3
chutsu
Member
 
Registered: Nov 2003
Location: UK
Distribution: Debian Lenny
Posts: 250

Original Poster
Rep: Reputation: 31
Do you have any suggestions on what book/tutorial I should read up before I begin, cause I'm very new with the idea of web robots.
Thanks
 
Old 05-30-2006, 01:10 PM   #4
slzckboy
Member
 
Registered: May 2005
Location: uk - Reading
Distribution: slack 10.2 kde 3.4.2 kernel 2.6.15
Posts: 452

Rep: Reputation: 30
Sorry,I just googled'google'd then google'd some more.

I found these helpful on my merry way.
http://en.wikipedia.org/wiki/Web_crawler
xmlsoft.org(start with the dom parser,but then look at the sax interface;much faster)
http://www.robotstxt.org/wc/robots.html
http://www.garshol.priv.no/download/text/http-tut.html
http://www.phy.duke.edu/~rgb/General...g_example.html

sorry I couldn't be of more help,but If you have any specific questions I will try my best to help.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What are these? Blog bots? lucktsm Linux - Security 2 03-14-2006 11:03 PM
sms program and database in linux web program in windows.. does not see each other.. keikun_naruchan Programming 0 07-06-2005 01:40 AM
bots maybe a possible hacker?? nepcw Linux - Security 3 10-04-2004 05:41 AM
is there any bots for rtcw? LavaDevil94 Linux - Games 6 10-31-2003 02:26 PM
Search Bots vexer Programming 2 01-13-2003 03:20 PM


All times are GMT -5. The time now is 03:30 PM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration