Downloading dynamically built web pages

slinx · 09-07-2012, 09:04 AM

Hello, I am trying to download a webpage that is dynamically generated using wget.

My goal is to search through a table of generated links to product items displayed in a table, and examine the links to see if they are correctly formed.

My problem is, that wget does not download the actual HTML displayed for this table of items. When I download the page with wget, none of the links appear in the output. I'm not even sure where the item links come from, although they are supposed to be generated by something called Celebros (http://www.celebros.com/).

How can I "scrape" the page as it is rendered in a browser?

Thank you for your help.

theNbomr · 09-07-2012, 09:26 AM

If you open the page with a conventional browser like Mozilla or Chrome, you can use it to show you the page source. That should give you a view of how the browser is used to render the links and other elements. It is hard to know why wget isn't working, but my first guess is that the 'links' are actually implemented in Javascript + something like AJAX. It is reasonable to imagine that the site was constructed this way to defeat scraping.

--- rod.

slinx · 09-07-2012, 01:26 PM

Yes thanks, I do know how to do that, and I have only been able to find where the script that loads that actual search content is placed. I am going to look into Scrapy or Watir to do what I need. I'll look at linklint too.