I've heard anything is possible on Linux, how about this?

wh33t · 07-30-2015, 05:58 PM

Is there a way I can run a web browser, such as firefox from the command line, have it go to a specific web page, and then dump out the html source that it has rendered to a text file, then close the browser?

dugan · 07-30-2015, 06:03 PM

Don't use a graphical browser for that. Use a modern headless browser like slimerjs or phantomjs, or a text browser like links, elinks or lynx. If it's a static page, then just curl or wget the HTML file.

If you must use a graphical browser, then you can script that using Selenium.

rigor · 07-30-2015, 06:10 PM

hi Wh33t,

You can use commands such as wget or curl from an xterm or other so-called "terminal emulator" window to grab a web page's HTML source, etc.

Other than in some context such having a browser view "the source" of the web page, I would normally think of the HTML, CSS, etc. as the web page source, and what a browser renders as two different things; the rendering resulting from the browser's interpretation of the source.

So when you talk about "the html source that it has rendered" I'm not sure what you want to accomplish.

If you want a file of what was rendered, you can do a "screen grab" to an image file of what the browser displays after it renders the web page.

HTH.

wh33t · 07-30-2015, 06:23 PM

Quote:

Originally Posted by rigor

hi Wh33t,

You can use commands such as wget or curl from an xterm or other so-called "terminal emulator" window to grab a web page's HTML source, etc.

Other than in some context such having a browser view "the source" of the web page, I would normally think of the HTML, CSS, etc. as the web page source, and what a browser renders as two different things; the rendering resulting from the browser's interpretation of the source.

So when you talk about "the html source that it has rendered" I'm not sure what you want to accomplish.

If you want a file of what was rendered, you can do a "screen grab" to an image file of what the browser displays after it renders the web page.

HTH.

I need to scour various webpages of one of our suppliers so we can update 3000+ items in our database with new information, versus trying to update it manually. Unfortunately our stupid suppliers won't give us access to their database and like to use fancy schmancy Javascript to ajax load the piece of information we want on our products. Unfortunately when I do a curl, or a file_get_contents() from php it doesn't process the javascript like a browser would do. So I was thinking I'd script in a command to launch a browser from the command line, dump the rendered source and then cruise that file for the information I need and then update our products.

mralk3 · 07-30-2015, 08:20 PM

It's called web scraping. Try searching that term. I have had a great experience scraping web sites with python and the "scrapy" library.

It's possible to do the same with bash, wget and grep from the command prompt as well. It's just not as pretty.

wh33t · 07-30-2015, 08:21 PM

Quote:

Originally Posted by mralk3

It's called web scraping. Try searching that term. I have had a great experience scraping web sites with python and the "scrapy" library.

It's possible to do the same with bash, wget and grep from the command prompt as well. It's just not as pretty.

I'm well aware of that, but I need to process Javascript in the scraped content. I'll look into the scrapy library.

Edit: Wow that scrapy library looks powerful. I'll dig into it more.

ardvark71 · 07-30-2015, 10:05 PM

Hi...

Just as a heads up, please be aware there are legal issues with doing this. Please see here and here.

Regards...

wh33t · 07-30-2015, 10:07 PM

Quote:

Originally Posted by ardvark71

Hi...

Just as a heads up, please be aware there are legal issues with doing this. Please see here and here.

Regards...

I appreciate the post. It's perfectly legal what we are doing. Our suppliers are OK with us doing it, they just ask to wait 500ms before each page hit to not over load their system, they also prefer we do it at night.

ardvark71 · 07-30-2015, 10:34 PM

Quote:

Originally Posted by wh33t

I appreciate the post. It's perfectly legal what we are doing. Our suppliers are OK with us doing it, they just ask to wait 500ms before each page hit to not over load their system, they also prefer we do it at night.

Cool, glad to hear.

Regards...

xuhdev · 07-30-2015, 11:40 PM

Is this what you want? http://doc.scrapy.org/en/latest/topics/firefox.html

wh33t · 07-30-2015, 11:44 PM

Quote:

Originally Posted by xuhdev

Is this what you want? http://doc.scrapy.org/en/latest/topics/firefox.html

Something like that. But I want to automate the browser through a script. I don't want to actually have to physically see Firefox or any other web browser for that matter. I think the text based browsers are probably where it's at.

mralk3 · 07-31-2015, 05:03 AM

@wh33t I realize you have permission to scrape this web site so please disregard what I am about to say. This is more for anyone else reading this thread.

Personally, I think it is silly that scraping has any legal constraints. I don't abuse it though, so I guess that is the difference. So please, do not abuse scrapy.

If a target you wish to scrape has limits and does do ip bans against scraping, there are a number of code examples on the web in using Tor + distributed scraping using python + scrapy. An example of such a web site is google.com. Google does however have a search API that can be used for this too.

I guess though that it might be frowned upon to use Tor for web scraping, so I will not post my personal resources on the topic.

syg00 · 07-31-2015, 05:14 AM

The other side is that the OPs supplier shouldn't be so bloody anal about this; give them the data.
And if the supplier is happy with the trawling of the pages, what can be the (general) objection ?.

dugan · 07-31-2015, 09:19 AM

Did you look into slimerjs and phantomjs?

mralk3 · 07-31-2015, 09:29 AM

You might be able to find some examples of crawling sites with java script content here. I haven't searched it though since I am not really sure what type of site you are trying to scrape.