LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 06-28-2020, 06:36 PM   #1
Skaperen
Senior Member
 
Registered: May 2009
Location: WV, USA
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386
Posts: 2,106
Blog Entries: 20

Rep: Reputation: 150Reputation: 150
getting a long text stream of a long web page


i have a web page that has lots of HTML and binary bytes in it that displays on Firefox as a plain-ole text file. if i access it with lynx it also comes out fine but it entirely pagified (i have to press space to get each page). curl gives me the HTML and binary characters.

what i really want is to get a continuous text stream of the whole thing. is there a tool that can do that? i have the HTML downloaded, so something that only works from a local file could be fine.
 
Old 06-28-2020, 07:54 PM   #2
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 9,696

Rep: Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327Reputation: 4327
"lynx -dump"?

Some other options are listed here:

Convert from HTML to formatted plain text using Lynx

Last edited by dugan; 06-28-2020 at 08:00 PM.
 
1 members found this post helpful.
Old 06-29-2020, 05:23 AM   #3
Skaperen
Senior Member
 
Registered: May 2009
Location: WV, USA
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386
Posts: 2,106

Original Poster
Blog Entries: 20

Rep: Reputation: 150Reputation: 150
that looks good. i'll have to try it later today.
 
Old 06-29-2020, 06:23 PM   #4
Skaperen
Senior Member
 
Registered: May 2009
Location: WV, USA
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386
Posts: 2,106

Original Poster
Blog Entries: 20

Rep: Reputation: 150Reputation: 150
i'm getting some extra blank lines in some places, though not all. the processing i planned to is going to ignore them, anyway, do lynx -dump works fine for this.
 
Old 07-01-2020, 01:48 PM   #5
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 1,151

Rep: Reputation: Disabled
Quote:
Originally Posted by Skaperen View Post
i'm getting some extra blank lines in some places
The old html2text (still present on Debian-based distros) has an option --style compat. The new python3-html2text (known on Debian-based distros as html2markdown.py3) has an option --single-line-break.

Runs of blank lines can be squeezed with cat -s.

Last edited by shruggy; 07-01-2020 at 01:51 PM.
 
1 members found this post helpful.
Old 07-01-2020, 02:50 PM   #6
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,943

Rep: Reputation: 805Reputation: 805Reputation: 805Reputation: 805Reputation: 805Reputation: 805Reputation: 805
Code:
url="https://www.linuxquestions.org/questions/linux-general-1/getting-a-long-text-stream-of-a-long-web-page-4175677801/"

curl "$url" -o LQTest.hml

html2text --ignore-links LQTest.hml
Or something like:

LQTest1.py
Code:
#! /usr/bin/python

from urllib import request
from html2text import html2text, HTML2Text

#User agent for requests
agent = ('Mozilla/5.0 (Windows NT 10.1; x86_64; rv:76.0)'
        ' Gecko/20100101 Firefox/76.0')
        
#Make request header
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
            
url = 'file:///path/to/LQTest.html'
#url = 'http://somewhere.com'

#Load the page with urllib
req = request.Request(url, data=None, headers=user_agent)
page = request.urlopen(req)

#Read it with html2text
html = page.read()
noLinks = HTML2Text()
noLinks.ignore_links = True
txt = noLinks.handle(html.decode('utf-8'))

print(txt)
Or

LQTest2.py
Code:
#! /usr/bin/python

from urllib import request
from bs4 import BeautifulSoup

#User agent for requests
agent = ('Mozilla/5.0 (Windows NT 10.1; x86_64; rv:76.0)'
        ' Gecko/20100101 Firefox/76.0')
        
#Make request header
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
            
url = 'file:///path/to/LQTest.html'
#url = 'http://somewhere.com'

#Load the page with urllib
req = request.Request(url, data=None, headers=user_agent)
page = request.urlopen(req)

#Get text of page with soup
soup = BeautifulSoup(page, features='lxml')
#Kill all script and style elements
for s in soup(["script", "style"]):
    s.extract()
txt = '\n'.join(soup.get_text().splitlines())

print(txt)
Quote:
i'm getting some extra blank lines in some places
You can remove them.
 
Old 07-02-2020, 10:15 AM   #7
Skaperen
Senior Member
 
Registered: May 2009
Location: WV, USA
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386
Posts: 2,106

Original Poster
Blog Entries: 20

Rep: Reputation: 150Reputation: 150
right. i'm not worried about the 300,000 blank lines. i will be parsing this text in my own script which can ignore blank lines.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
executing linux commands from web page and outputing it back to the web page ashes_sheldon Programming 9 02-28-2015 12:07 AM
[SOLVED] Web page input executes on command line and output back to web page keif Programming 7 02-26-2014 10:25 AM
Live RTSP stream in a web page Emerson Linux - Server 1 02-24-2014 07:33 PM
long long long: Too long for GCC Kenny_Strawn Programming 5 09-18-2010 01:14 AM
no access to web page with audio stream raldo Ubuntu 9 10-07-2009 01:17 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 03:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration