LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Spider software or similar? (https://www.linuxquestions.org/questions/linux-software-2/spider-software-or-similar-376868/)

bruno buys 10-25-2005 05:35 PM

Spider software or similar?
 
I'm working on a project where we need to colect news stories from some online newspapers. So far we have been using human work (me) to enter data on a database, but as we need to colect also in weekends, holidays and so, we have to decide for an automatic solution for this.
I tried
wget -r http://newspaper-website.../specific-section-we-need

but wget goes crazy in dynamic websites like newspapers, with asp, php and friends. The recursive option doesn't work in http protocol, so don't think wget is the one.
The engine must be able to do some smart things, like saving pictures related to the story, saving the url, figuring out what part of the html code is the actual news stories, etc. And it must build some sort of easy-to-query database.

It can be a program, a script or a mix of both.
Any ideas?

Tinkster 10-25-2005 07:09 PM

You could give pavuk a shot ...



Cheers,
Tink

bruno buys 10-26-2005 09:20 PM

I installed and tried pavuk. Seems very nice. I didnīt try all of its massive list of features, but it does seem to be suited.
Newspapers create a huge load of material everyday. Being able to mirror it localy is a big step, as it frees me from having to file it everyday, manually.
Now the issue boils down, I guess, to how to parse (?) arbitrary fields from the shtml/html files downloaded, to some sort of database, which will be most likely SQL...


All times are GMT -5. The time now is 12:10 AM.