how hard would making 'socket based web page retriever' be?
hello.
i've used urllib2 for getting some pages
but it's incredibly slow.
i've looked at some open source crawlers too.
but would take to figure out enough to modify them.
ultimately i want to have my own crawler.
before that, if i can handle just one page downloading.. it will be easier to change it to crawler.
send request
receive header
if it's text/html
store data in a file
when socket disconnects
close file
looks like simple , but i guess there are tons small things to take care
so my question is, how hard (how long would it take) to reasonably handle http protocol..
thank you
|