I'm working on a project where we need to colect news stories from some online newspapers. So far we have been using human work (me) to enter data on a database, but as we need to colect also in weekends, holidays and so, we have to decide for an automatic solution for this.
I tried
wget -r
http://newspaper-website.../specific-section-we-need
but wget goes crazy in dynamic websites like newspapers, with asp, php and friends. The recursive option doesn't work in http protocol, so don't think wget is the one.
The engine must be able to do some smart things, like saving pictures related to the story, saving the url, figuring out what part of the html code is the actual news stories, etc. And it must build some sort of easy-to-query database.
It can be a program, a script or a mix of both.
Any ideas?