LinuxQuestions.org - [SOLVED] getting a long text stream of a long web page

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - getting a long text stream of a long web page (https://www.linuxquestions.org/questions/linux-general-1/getting-a-long-text-stream-of-a-long-web-page-4175677801/)

getting a long text stream of a long web page

i have a web page that has lots of HTML and binary bytes in it that displays on Firefox as a plain-ole text file. if i access it with lynx it also comes out fine but it entirely pagified (i have to press space to get each page). curl gives me the HTML and binary characters.

what i really want is to get a continuous text stream of the whole thing. is there a tool that can do that? i have the HTML downloaded, so something that only works from a local file could be fine.

"lynx -dump"?

Some other options are listed here:

Convert from HTML to formatted plain text using Lynx

that looks good. i'll have to try it later today.

i'm getting some extra blank lines in some places, though not all. the processing i planned to is going to ignore them, anyway, do lynx -dump works fine for this.

Quote:

Originally Posted by Skaperen (Post 6139454)

i'm getting some extra blank lines in some places

The old html2text (still present on Debian-based distros) has an option --style compat. The new python3-html2text (known on Debian-based distros as html2markdown.py3) has an option --single-line-break.

Runs of blank lines can be squeezed with cat -s.

Code:

url="https://www.linuxquestions.org/questions/linux-general-1/getting-a-long-text-stream-of-a-long-web-page-4175677801/"



curl "$url" -o LQTest.hml



html2text --ignore-links LQTest.hml

Or something like:

LQTest1.py

Code:

#! /usr/bin/python



from urllib import request

from html2text import html2text, HTML2Text



#User agent for requests

agent = ('Mozilla/5.0 (Windows NT 10.1; x86_64; rv:76.0)'

        ' Gecko/20100101 Firefox/76.0')

        

#Make request header

user_agent = {'User-Agent': agent,

            'Accept': 'text/html,application/xhtml+xml,'

            'application/xml;q=0.9,*/*;q=0.8',

            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',

            'Accept-Encoding': 'none',

            'Accept-Language': 'en-US,en;q=0.8',

            'Connection': 'keep-alive'}

            

url = 'file:///path/to/LQTest.html'

#url = 'http://somewhere.com'



#Load the page with urllib

req = request.Request(url, data=None, headers=user_agent)

page = request.urlopen(req)



#Read it with html2text

html = page.read()

noLinks = HTML2Text()

noLinks.ignore_links = True

txt = noLinks.handle(html.decode('utf-8'))



print(txt)

Or

LQTest2.py

Code:

#! /usr/bin/python



from urllib import request

from bs4 import BeautifulSoup



#User agent for requests

agent = ('Mozilla/5.0 (Windows NT 10.1; x86_64; rv:76.0)'

        ' Gecko/20100101 Firefox/76.0')

        

#Make request header

user_agent = {'User-Agent': agent,

            'Accept': 'text/html,application/xhtml+xml,'

            'application/xml;q=0.9,*/*;q=0.8',

            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',

            'Accept-Encoding': 'none',

            'Accept-Language': 'en-US,en;q=0.8',

            'Connection': 'keep-alive'}

            

url = 'file:///path/to/LQTest.html'

#url = 'http://somewhere.com'



#Load the page with urllib

req = request.Request(url, data=None, headers=user_agent)

page = request.urlopen(req)



#Get text of page with soup

soup = BeautifulSoup(page, features='lxml')

#Kill all script and style elements

for s in soup(["script", "style"]):

    s.extract()

txt = '\n'.join(soup.get_text().splitlines())



print(txt)

Quote:

i'm getting some extra blank lines in some places

You can remove them.

right. i'm not worried about the 300,000 blank lines. i will be parsing this text in my own script which can ignore blank lines.