HTML parsing library

nodger · 08-29-2005, 04:43 PM

Heres what Im looking for: Im creating a web spidering app and I need to be able to extract the links and the images on each page. I need an industrial-strength library that can do all the hard work for me

Something that can understand JavaScript so scripted links/pictures are recognised too, of course I don't want it getting fooled be infinite loops and exploits.

Would it be a good Idea to down load the Mozilla source code, or is there already a library that web browsers use for all this?

PS: Im using CURL to handle all the HTTP. this is an excellent library, and mageMagick to handle the images. Excellent aswell.
Help me out!
Thanks

lowpro2k3 · 09-01-2005, 01:42 AM

Language? If you know Perl, check CPAN under the "World Wide Web/HTML" directory.