I've heard anything is possible on Linux, how about this?
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I've heard anything is possible on Linux, how about this?
Is there a way I can run a web browser, such as firefox from the command line, have it go to a specific web page, and then dump out the html source that it has rendered to a text file, then close the browser?
Don't use a graphical browser for that. Use a modern headless browser like slimerjs or phantomjs, or a text browser like links, elinks or lynx. If it's a static page, then just curl or wget the HTML file.
If you must use a graphical browser, then you can script that using Selenium.
You can use commands such as wget or curl from an xterm or other so-called "terminal emulator" window to grab a web page's HTML source, etc.
Other than in some context such having a browser view "the source" of the web page, I would normally think of the HTML, CSS, etc. as the web page source, and what a browser renders as two different things; the rendering resulting from the browser's interpretation of the source.
So when you talk about "the html source that it has rendered" I'm not sure what you want to accomplish.
If you want a file of what was rendered, you can do a "screen grab" to an image file of what the browser displays after it renders the web page.
You can use commands such as wget or curl from an xterm or other so-called "terminal emulator" window to grab a web page's HTML source, etc.
Other than in some context such having a browser view "the source" of the web page, I would normally think of the HTML, CSS, etc. as the web page source, and what a browser renders as two different things; the rendering resulting from the browser's interpretation of the source.
So when you talk about "the html source that it has rendered" I'm not sure what you want to accomplish.
If you want a file of what was rendered, you can do a "screen grab" to an image file of what the browser displays after it renders the web page.
HTH.
I need to scour various webpages of one of our suppliers so we can update 3000+ items in our database with new information, versus trying to update it manually. Unfortunately our stupid suppliers won't give us access to their database and like to use fancy schmancy Javascript to ajax load the piece of information we want on our products. Unfortunately when I do a curl, or a file_get_contents() from php it doesn't process the javascript like a browser would do. So I was thinking I'd script in a command to launch a browser from the command line, dump the rendered source and then cruise that file for the information I need and then update our products.
Just as a heads up, please be aware there are legal issues with doing this. Please see here and here.
Regards...
I appreciate the post. It's perfectly legal what we are doing. Our suppliers are OK with us doing it, they just ask to wait 500ms before each page hit to not over load their system, they also prefer we do it at night.
I appreciate the post. It's perfectly legal what we are doing. Our suppliers are OK with us doing it, they just ask to wait 500ms before each page hit to not over load their system, they also prefer we do it at night.
Something like that. But I want to automate the browser through a script. I don't want to actually have to physically see Firefox or any other web browser for that matter. I think the text based browsers are probably where it's at.
@wh33t I realize you have permission to scrape this web site so please disregard what I am about to say. This is more for anyone else reading this thread.
Personally, I think it is silly that scraping has any legal constraints. I don't abuse it though, so I guess that is the difference. So please, do not abuse scrapy.
If a target you wish to scrape has limits and does do ip bans against scraping, there are a number of code examples on the web in using Tor + distributed scraping using python + scrapy. An example of such a web site is google.com. Google does however have a search API that can be used for this too.
I guess though that it might be frowned upon to use Tor for web scraping, so I will not post my personal resources on the topic.
The other side is that the OPs supplier shouldn't be so bloody anal about this; give them the data.
And if the supplier is happy with the trawling of the pages, what can be the (general) objection ?.
You might be able to find some examples of crawling sites with java script content here. I haven't searched it though since I am not really sure what type of site you are trying to scrape.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.