LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-30-2015, 05:58 PM   #1
wh33t
Member
 
Registered: Oct 2003
Location: Canada
Posts: 922

Rep: Reputation: 61
I've heard anything is possible on Linux, how about this?


Is there a way I can run a web browser, such as firefox from the command line, have it go to a specific web page, and then dump out the html source that it has rendered to a text file, then close the browser?
 
Old 07-30-2015, 06:03 PM   #2
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,223

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Don't use a graphical browser for that. Use a modern headless browser like slimerjs or phantomjs, or a text browser like links, elinks or lynx. If it's a static page, then just curl or wget the HTML file.

If you must use a graphical browser, then you can script that using Selenium.

Last edited by dugan; 07-30-2015 at 06:05 PM.
 
1 members found this post helpful.
Old 07-30-2015, 06:10 PM   #3
rigor
Member
 
Registered: Sep 2003
Location: 19th moon ................. ................Planet Covid ................Another Galaxy;............. ................Not Yours
Posts: 705

Rep: Reputation: Disabled
hi Wh33t,


You can use commands such as wget or curl from an xterm or other so-called "terminal emulator" window to grab a web page's HTML source, etc.

Other than in some context such having a browser view "the source" of the web page, I would normally think of the HTML, CSS, etc. as the web page source, and what a browser renders as two different things; the rendering resulting from the browser's interpretation of the source.

So when you talk about "the html source that it has rendered" I'm not sure what you want to accomplish.

If you want a file of what was rendered, you can do a "screen grab" to an image file of what the browser displays after it renders the web page.


HTH.
 
2 members found this post helpful.
Old 07-30-2015, 06:23 PM   #4
wh33t
Member
 
Registered: Oct 2003
Location: Canada
Posts: 922

Original Poster
Rep: Reputation: 61
Quote:
Originally Posted by rigor View Post
hi Wh33t,


You can use commands such as wget or curl from an xterm or other so-called "terminal emulator" window to grab a web page's HTML source, etc.

Other than in some context such having a browser view "the source" of the web page, I would normally think of the HTML, CSS, etc. as the web page source, and what a browser renders as two different things; the rendering resulting from the browser's interpretation of the source.

So when you talk about "the html source that it has rendered" I'm not sure what you want to accomplish.

If you want a file of what was rendered, you can do a "screen grab" to an image file of what the browser displays after it renders the web page.


HTH.
I need to scour various webpages of one of our suppliers so we can update 3000+ items in our database with new information, versus trying to update it manually. Unfortunately our stupid suppliers won't give us access to their database and like to use fancy schmancy Javascript to ajax load the piece of information we want on our products. Unfortunately when I do a curl, or a file_get_contents() from php it doesn't process the javascript like a browser would do. So I was thinking I'd script in a command to launch a browser from the command line, dump the rendered source and then cruise that file for the information I need and then update our products.
 
Old 07-30-2015, 08:20 PM   #5
mralk3
Slackware Contributor
 
Registered: May 2015
Distribution: Slackware
Posts: 1,900

Rep: Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050
I've heard anything is possible on Linux, how about this?

It's called web scraping. Try searching that term. I have had a great experience scraping web sites with python and the "scrapy" library.

It's possible to do the same with bash, wget and grep from the command prompt as well. It's just not as pretty.
 
1 members found this post helpful.
Old 07-30-2015, 08:21 PM   #6
wh33t
Member
 
Registered: Oct 2003
Location: Canada
Posts: 922

Original Poster
Rep: Reputation: 61
Quote:
Originally Posted by mralk3 View Post
It's called web scraping. Try searching that term. I have had a great experience scraping web sites with python and the "scrapy" library.

It's possible to do the same with bash, wget and grep from the command prompt as well. It's just not as pretty.
I'm well aware of that, but I need to process Javascript in the scraped content. I'll look into the scrapy library.

Edit: Wow that scrapy library looks powerful. I'll dig into it more.

Last edited by wh33t; 07-30-2015 at 08:23 PM.
 
Old 07-30-2015, 10:05 PM   #7
ardvark71
LQ Veteran
 
Registered: Feb 2015
Location: USA
Distribution: Lubuntu 14.04, 22.04, Windows 8.1 and 10
Posts: 6,282
Blog Entries: 4

Rep: Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842
Hi...

Just as a heads up, please be aware there are legal issues with doing this. Please see here and here.

Regards...
 
1 members found this post helpful.
Old 07-30-2015, 10:07 PM   #8
wh33t
Member
 
Registered: Oct 2003
Location: Canada
Posts: 922

Original Poster
Rep: Reputation: 61
Quote:
Originally Posted by ardvark71 View Post
Hi...

Just as a heads up, please be aware there are legal issues with doing this. Please see here and here.

Regards...
I appreciate the post. It's perfectly legal what we are doing. Our suppliers are OK with us doing it, they just ask to wait 500ms before each page hit to not over load their system, they also prefer we do it at night.
 
Old 07-30-2015, 10:34 PM   #9
ardvark71
LQ Veteran
 
Registered: Feb 2015
Location: USA
Distribution: Lubuntu 14.04, 22.04, Windows 8.1 and 10
Posts: 6,282
Blog Entries: 4

Rep: Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842Reputation: 842
Quote:
Originally Posted by wh33t View Post
I appreciate the post. It's perfectly legal what we are doing. Our suppliers are OK with us doing it, they just ask to wait 500ms before each page hit to not over load their system, they also prefer we do it at night.
Cool, glad to hear.

Regards...
 
1 members found this post helpful.
Old 07-30-2015, 11:40 PM   #10
xuhdev
LQ Newbie
 
Registered: Jun 2015
Distribution: CentOS,Debian,Ubuntu
Posts: 19

Rep: Reputation: Disabled
Is this what you want? http://doc.scrapy.org/en/latest/topics/firefox.html
 
1 members found this post helpful.
Old 07-30-2015, 11:44 PM   #11
wh33t
Member
 
Registered: Oct 2003
Location: Canada
Posts: 922

Original Poster
Rep: Reputation: 61
Quote:
Originally Posted by xuhdev View Post
Something like that. But I want to automate the browser through a script. I don't want to actually have to physically see Firefox or any other web browser for that matter. I think the text based browsers are probably where it's at.
 
Old 07-31-2015, 05:03 AM   #12
mralk3
Slackware Contributor
 
Registered: May 2015
Distribution: Slackware
Posts: 1,900

Rep: Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050
@wh33t I realize you have permission to scrape this web site so please disregard what I am about to say. This is more for anyone else reading this thread.

Personally, I think it is silly that scraping has any legal constraints. I don't abuse it though, so I guess that is the difference. So please, do not abuse scrapy.

If a target you wish to scrape has limits and does do ip bans against scraping, there are a number of code examples on the web in using Tor + distributed scraping using python + scrapy. An example of such a web site is google.com. Google does however have a search API that can be used for this too.

I guess though that it might be frowned upon to use Tor for web scraping, so I will not post my personal resources on the topic.
 
1 members found this post helpful.
Old 07-31-2015, 05:14 AM   #13
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
The other side is that the OPs supplier shouldn't be so bloody anal about this; give them the data.
And if the supplier is happy with the trawling of the pages, what can be the (general) objection ?.
 
1 members found this post helpful.
Old 07-31-2015, 09:19 AM   #14
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,223

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Did you look into slimerjs and phantomjs?
 
1 members found this post helpful.
Old 07-31-2015, 09:29 AM   #15
mralk3
Slackware Contributor
 
Registered: May 2015
Distribution: Slackware
Posts: 1,900

Rep: Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050Reputation: 1050
You might be able to find some examples of crawling sites with java script content here. I haven't searched it though since I am not really sure what type of site you are trying to scrape.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
A linux you might not have heard about TigerLinux Linux - Distributions 8 01-27-2011 10:35 PM
anyone heard of Foresight Linux? shadowfx78 Linux - General 10 06-11-2007 06:43 AM
Ever heard of Alinux and HOW do you burn a 800mb CD I've never heard of one BiPolarPenguin General 4 12-19-2006 08:56 PM
LXer: Loan Linux Your Larynx - Let Your Voice Be Heard…No, REALLY Heard LXer Syndicated Linux News 0 01-29-2006 11:03 PM
Anyone heard about Linux MZ? bigstorm Linux - Newbie 8 11-01-2005 01:27 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 04:40 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration