[SOLVED] printing the text from a web page without printing the graphics
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
printing the text from a web page without printing the graphics
A "good" website to me will have some button or mode that allows one to print only the text if desired, not wasting ink/toner on printing the graphics. Unfortunately, any number of professionally designed sites (probably the majority, even) don't. I don't suppose Firefox (or Pale Moon, in my case) has its own option to ignore any graphics when printing? I'm looking but haven't seen it.
Of course I know the simplest way to print only the text: highlight it all, then copy and paste it into a word processor. I just had to do that for an article I wanted a hard copy of. And it's not as though toner is expensive anymore, I admit. This is one of those things I ask mainly on principle.
You know, I can guess the problem here: when these websites don't easily facilitate printing their text, it's most likely because the sites are optimized for being read on one's phone, where printing is impossible. (Doesn't that sound absurd to a non-millennial? Reading on a phone? Like listening to a piece of paper?)
I thought there was a built-in function in Firefox to have a user-specified stylesheet override. Maybe there's not, but I would look for that first.
Otherwise there are a number of plug-ins or add-ons or whatever they are called that allow you to do CSS overrides selectively.
The heart of the problem is that competency in web design is becoming as rare as hen's teeth. There are some good people out there still but fewer are working and even fewer are teaching. Even fewer are coming up through the ranks. So it's likely terminal stage situation we are seeing because the very few that can actually do web design have moved on or even retired. Certainly they're no longer in positions to deal with the politics necessitated by the boss' cousin or college buddy's son's assertions of skill in web design. Look, even banks have third-party objects slowing down their pages, and that includes both CSS and JS.
<grumble> If you contact the web site in question, it would be interesting to know what they say if they repsond. However, most have a catalog of excuses handy. I just had to deal with yet another one that became glacially slow and bandwidth after a site 'upgrade'. Fine. However what is not fine is that they recently inserted an Adobe Flash dependency between visitors and checkout/payment for services. I bet they're wondering why sales have all but stopped. </grumble>
A quick and easy way is to use a text based browser as links, w3m, etc. which allow you to dump a web page as text.
Another way is to bring up the developer tools (F12 or ctrl-shift-i in Firefox) and then edit or delete some html, css, javascript stuff in there. I do it routinely.
If you want a lighter page that is designed for an iphone, and the web server serves up pages that way, then report yourself as one, either as your browsers user agent, or your scripts user agent.
iPhone 10
Code:
agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/14A403 Safari/602.1"
Quote:
not wasting ink/toner on printing the graphics.
Then get the page with images turned off, save it to file that way. Print the file.
Quote:
has its own option to ignore any graphics when printing?
Sorry, I haven't went that route for years now. I did not like being browser dependent for such tasks.
If all you want is the text, and you don't care about scripts being run
Code:
curl -A "$agent" <url> -o MyFile.html
Then print MyFile.html. You may want to turn it into a .ps .pdf of .txt first.
If you want the page to look right, but without images, then.... either turn images off in the browser before you load the page, or use a browser engine in a script to get the page without images.
Example
Code:
#! /usr/bin/env python
#Get source with scripts run using Python3/PyQt5/qt5-webengine
#Usage:
#script.py <url> <local filename>
#or script.py and answer prompts
import sys
from PyQt5.QtWebEngineWidgets import (QWebEnginePage,
QWebEngineProfile, QWebEngineView, QWebEngineSettings)
from PyQt5.QtWidgets import QApplication, QMainWindow
from PyQt5.QtCore import QUrl
#iphone 6 Safari
a = ('Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X)'
' AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0'
' Mobile/10B350 Safari/8536.25')
class Source(QWebEnginePage):
def __init__(self, url, _file):
self.app = QApplication([]) #(sys.argv)
QWebEnginePage.__init__(self)
self.agent = QWebEngineProfile(self)
self.agent.defaultProfile().setHttpUserAgent(a) #set ua here
self.view = QWebEngineView() #Images off
self.view.settings().setAttribute(
QWebEngineSettings.AutoLoadImages, False)
self._file = _file
self.load(QUrl(url))
self.loadFinished.connect(self.on_load_finished)
self.app.exec_()
def on_load_finished(self):
self.html = self.toHtml(self.write_it)
def write_it(self, data):
self.html = data
with open (self._file, 'w') as f:
f.write (self.html)
print ('\nFinished\nFile saved to ' + (self._file))
self.app.quit()
def main():
#Open with arguments or prompt for input
if len(sys.argv) > 2:
url = (sys.argv[1])
_file = (sys.argv[2])
else:
url = input('Enter/Paste url for source: ')
_file = input('Enter output file name: ')
Source(url, _file)
if __name__ == '__main__':
main()
You could also use Phantomjs, nodejs, soup etc. I'm using webengine for my scripts. It works just like a browser of course.
Lynx will get a text format of a page
Code:
lynx -dump url > out.txt
curl will, something like
Code:
curl url | html2text > out.txt
Another words, 2 steps, get page the way you want, save to file , print file. Not dependent on any browser.
The heart of the problem is that competency in web design is becoming as rare as hen's teeth. There are some good people out there still but fewer are working and even fewer are teaching. Even fewer are coming up through the ranks. So it's likely terminal stage situation we are seeing because the very few that can actually do web design have moved on or even retired. Certainly they're no longer in positions to deal with the politics necessitated by the boss' cousin or college buddy's son's assertions of skill in web design.=
I had no idea. Why is this? Sorry, I haven't figured it out from what you said--why don't people want to work in web design anymore?
I had no idea. Why is this? Sorry, I haven't figured it out from what you said--why don't people want to work in web design anymore?
I'm not sure why that is. There are probably more people today claiming to work in web design than ever before but the end product shows very clearly that neither the knowledge nor the skill is there. It seems that every other week a commercial site or two that I need or someone I know needs has fallen to their inept fiddlings. One site in particular that caused trouble to several different people I know went through two very major web site redesigns very recently and each time the redesign made it impossible for more and more potential customers to buy their services. It's like they're trying to go out of business.
I doubt there is a single group that can be blamed specifically for the shift in the sites and the loss of knowledge from the general population. However, there are several groups which have shown great effort in diminishing and disparaging knowledge, especially when it comes to ICT. There is a strong current of argumentum ad novitatem pervading the computing industry, especially the web sector rather than an emphasis on finding what works or even on Usability design. Though a lot of governments gain surveilance and control capabilities by further centralizing the WWW. As does Google which would gain leverage to force people into their centralized AMP hosting as sites get even more bloated. However, bloating is only one factor and I'm mainly railing against the complete lack of Usability and even basic functionality such as ordering or payment.
Anyway, the statements of Vint Cerf may seem quaint to some, and aimed at the net overall and not necessarily at just the WWW by itself, but to put it another way all of the market is more money than some of the market:
Yeah. It's fairly new, 2002, but it was something that was talked about a lot online for ages and ages earlier and becoming more and more critical as time passed. I guess it could be seen coming to a head back then and needed to be brought up in an RFC. Again it's weird. Things used to be about growing a market or maximizing reach with the least effort and expense, but now it has turned on its head and is about using outrageous amounts of effort and costs to artificially narrow the potential market to a fixed subset of people. Like with many of the harmful fads on the net, doing it right would have been faster, cheaper, easier and would have produced higher return on investment through expanded market reach.
Thanks. That's a useful discussion of the problem. Bloat (and third-party objects) has only gotten worse since the time the that post was written. It is right on target:
"Why not just serve regular HTML without stuffing it full of useless crap? The question is left unanswered."
I suppose that although Google could use its weight to reward lean pages and punish bloat, they gain if they can increase the bloat until everyone seeks refuge in AMP hosted/cached on Google's own servers.
This thread has jogged my memory: I recall one audit report from about 10 years ago where one country's audit office investigated where (IIRC) 25M EUR had gone. The money had been earmarked for "developing" a slew of web sites. The answer in the report was that it was all spent on web designers and produced no visible results. It was convenient for me at the time to follow up very superficially on the report and I found that, in the geographic area affected, it looked to me like there were few if any skilled web teams. And, I'll go way out on a limb, it sure looked like, at the time, there were not any in educational positions any more to train up skilled teams even if there might have once been some. Thus a downward cycle had started.
If someone deploys BS, it is because they have learned BS and learned that it is ok to deploy BS, and if they have learned to use and endorse BS, someone certainly was involved in teaching them that BS. Given what I've seen since, I'm fairly sure that was and remains the case -- in more than one country.
Some sites will provide the means to show printable text for the page. Here at LQ you will find that under 'Thread tools' as 'Show Printable Text'. You may need to search the other site(s) for this option but I know some will provide this service.
The conference summary of the talk has a link to the video which, on top of the good content, turned out to be surprisingly well delivered. Here is the link to that, too, if some would rather listen than read:
Some years ago the W3C used to have some pull with developers but somehow it has given up. Google, Amazon, and Facebook seem to call the shots now. If any two of them were to decide on anything together, they'd basically cause the decision to become a defacto standard simply through their massiveness.
The web if done correctly is rather device independent.
It is? But besides computer technology being obsolete the day you buy it, due to the pace of innovation (according to non-expert popular wisdom), 2002 was the Web 1.0 era. How do you call it recent, then?
Quote:
Originally Posted by Turbocapitalist
Google, Amazon, and Facebook seem to call the shots now. If any two of them were to decide on anything together, they'd basically cause the decision to become a defacto standard simply through their massiveness.
Wouldn't that be because the foundation of Web 2.0 is the full monetization of the internet?
Last edited by newbiesforever; 03-13-2018 at 08:46 AM.
Wouldn't that be because the foundation of Web 2.0 is the full monetization of the internet?
No. But that would be a different matter. Monetization works, but as the late Pieter Hintjens pointed out, happy customers are usually profitable customers. What's common is to squeeze too hard, and that loses money quickly after the first round. Same for making inefficiencies like shown in the presentation (text or video) above in the comparison of Pinboard vs "ACME".
Google's moves make sense, especially in regards to AMP, if they plan to capture part of the net. We've been through that before. Closed nets just don't grow. See history about CompuServe, Prodigy, Delphi, MSN (the original version) and others. The WWW grew, and thus the Internet beneath it, because it was open, just follow the RFCs and you are in. And what are RFCs but contracts.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.