LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 10-25-2005, 05:35 PM   #1
bruno buys
Senior Member
 
Registered: Sep 2003
Location: Rio
Distribution: Debian
Posts: 1,511

Rep: Reputation: 46
Spider software or similar?


I'm working on a project where we need to colect news stories from some online newspapers. So far we have been using human work (me) to enter data on a database, but as we need to colect also in weekends, holidays and so, we have to decide for an automatic solution for this.
I tried
wget -r http://newspaper-website.../specific-section-we-need

but wget goes crazy in dynamic websites like newspapers, with asp, php and friends. The recursive option doesn't work in http protocol, so don't think wget is the one.
The engine must be able to do some smart things, like saving pictures related to the story, saving the url, figuring out what part of the html code is the actual news stories, etc. And it must build some sort of easy-to-query database.

It can be a program, a script or a mix of both.
Any ideas?
 
Old 10-25-2005, 07:09 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
You could give pavuk a shot ...



Cheers,
Tink
 
Old 10-26-2005, 09:20 PM   #3
bruno buys
Senior Member
 
Registered: Sep 2003
Location: Rio
Distribution: Debian
Posts: 1,511

Original Poster
Rep: Reputation: 46
I installed and tried pavuk. Seems very nice. I didnīt try all of its massive list of features, but it does seem to be suited.
Newspapers create a huge load of material everyday. Being able to mirror it localy is a big step, as it frees me from having to file it everyday, manually.
Now the issue boils down, I guess, to how to parse (?) arbitrary fields from the shtml/html files downloaded, to some sort of database, which will be most likely SQL...
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Any linux based software similar to M$ VB? bin75 Programming 9 10-13-2005 08:07 AM
Software similar to CoverXP ./usr/stevo Linux - Software 1 04-02-2005 04:45 PM
software similar to visio kafnir Linux - Software 2 12-15-2004 09:20 PM
tivo- similar linux recording software cjae Linux - Software 3 04-19-2004 07:21 AM
is there a software similar to dreamweaver in linux? spyghost Linux - Software 3 09-03-2003 11:42 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 04:36 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration