Heres what Im looking for: Im creating a web spidering app and I need to be able to extract the links and the images on each page. I need an industrial-strength library that can do all the hard work for me
Something that can understand JavaScript so scripted links/pictures are recognised too, of course I don't want it getting fooled be infinite loops and exploits.
Would it be a good Idea to down load the Mozilla source code, or is there already a library that web browsers use for all this?
PS: Im using CURL to handle all the HTTP. this is an excellent library, and mageMagick to handle the images. Excellent aswell.
Help me out!
Thanks