LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-29-2005, 04:43 PM   #1
nodger
Member
 
Registered: Oct 2003
Location: Ireland
Distribution: Slackware 9.1, Ubuntu
Posts: 192

Rep: Reputation: 30
HTML parsing library


Heres what Im looking for: Im creating a web spidering app and I need to be able to extract the links and the images on each page. I need an industrial-strength library that can do all the hard work for me

Something that can understand JavaScript so scripted links/pictures are recognised too, of course I don't want it getting fooled be infinite loops and exploits.

Would it be a good Idea to down load the Mozilla source code, or is there already a library that web browsers use for all this?

PS: Im using CURL to handle all the HTTP. this is an excellent library, and mageMagick to handle the images. Excellent aswell.
Help me out!
Thanks

Last edited by nodger; 08-29-2005 at 04:44 PM.
 
Old 09-01-2005, 01:42 AM   #2
lowpro2k3
Member
 
Registered: Oct 2003
Location: Canada
Distribution: Slackware
Posts: 340

Rep: Reputation: 30
Language? If you know Perl, check CPAN under the "World Wide Web/HTML" directory.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Parsing out html with egrep binaryechoes Linux - Software 2 12-02-2005 11:49 PM
Parsing out html with egrep binaryechoes Linux - Newbie 3 12-02-2005 12:41 AM
HTML parsing with HTML::TreeBuilder smaida Programming 0 07-10-2005 09:58 PM
Parsing HTML using Perl smaida Programming 2 05-29-2004 01:20 PM
Parsing Text from a html file. Rezon Programming 6 10-18-2003 12:09 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:44 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration