how hard would making 'socket based web page retriever' be?

doublefailure · 04-24-2003, 06:47 AM

hello.

i've used urllib2 for getting some pages
but it's incredibly slow.
i've looked at some open source crawlers too.
but would take to figure out enough to modify them.

ultimately i want to have my own crawler.

before that, if i can handle just one page downloading.. it will be easier to change it to crawler.

send request
receive header
if it's text/html
store data in a file
when socket disconnects
close file

looks like simple , but i guess there are tons small things to take care

so my question is, how hard (how long would it take) to reasonably handle http protocol..

thank you

nakkaya · 04-24-2003, 08:35 PM

i am working on the same subject crawler i can recommend beej guide to socet programming

send reguest easy
recive header you dont have do anything
if text/html is alittle more trick and thats the hard part cause you are going from a web page
storing is easier
close no problem

nakkaya · 04-24-2003, 08:36 PM

btw you do not need alib for that just use plain socket.h