LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   getting a long text stream of a long web page (https://www.linuxquestions.org/questions/linux-general-1/getting-a-long-text-stream-of-a-long-web-page-4175677801/)

Skaperen 06-28-2020 06:36 PM

getting a long text stream of a long web page
 
i have a web page that has lots of HTML and binary bytes in it that displays on Firefox as a plain-ole text file. if i access it with lynx it also comes out fine but it entirely pagified (i have to press space to get each page). curl gives me the HTML and binary characters.

what i really want is to get a continuous text stream of the whole thing. is there a tool that can do that? i have the HTML downloaded, so something that only works from a local file could be fine.

dugan 06-28-2020 07:54 PM

"lynx -dump"?

Some other options are listed here:

Convert from HTML to formatted plain text using Lynx

Skaperen 06-29-2020 05:23 AM

that looks good. i'll have to try it later today.

Skaperen 06-29-2020 06:23 PM

i'm getting some extra blank lines in some places, though not all. the processing i planned to is going to ignore them, anyway, do lynx -dump works fine for this.

shruggy 07-01-2020 01:48 PM

Quote:

Originally Posted by Skaperen (Post 6139454)
i'm getting some extra blank lines in some places

The old html2text (still present on Debian-based distros) has an option --style compat. The new python3-html2text (known on Debian-based distros as html2markdown.py3) has an option --single-line-break.

Runs of blank lines can be squeezed with cat -s.

teckk 07-01-2020 02:50 PM

Code:

url="https://www.linuxquestions.org/questions/linux-general-1/getting-a-long-text-stream-of-a-long-web-page-4175677801/"

curl "$url" -o LQTest.hml

html2text --ignore-links LQTest.hml

Or something like:

LQTest1.py
Code:

#! /usr/bin/python

from urllib import request
from html2text import html2text, HTML2Text

#User agent for requests
agent = ('Mozilla/5.0 (Windows NT 10.1; x86_64; rv:76.0)'
        ' Gecko/20100101 Firefox/76.0')
       
#Make request header
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
           
url = 'file:///path/to/LQTest.html'
#url = 'http://somewhere.com'

#Load the page with urllib
req = request.Request(url, data=None, headers=user_agent)
page = request.urlopen(req)

#Read it with html2text
html = page.read()
noLinks = HTML2Text()
noLinks.ignore_links = True
txt = noLinks.handle(html.decode('utf-8'))

print(txt)

Or

LQTest2.py
Code:

#! /usr/bin/python

from urllib import request
from bs4 import BeautifulSoup

#User agent for requests
agent = ('Mozilla/5.0 (Windows NT 10.1; x86_64; rv:76.0)'
        ' Gecko/20100101 Firefox/76.0')
       
#Make request header
user_agent = {'User-Agent': agent,
            'Accept': 'text/html,application/xhtml+xml,'
            'application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
           
url = 'file:///path/to/LQTest.html'
#url = 'http://somewhere.com'

#Load the page with urllib
req = request.Request(url, data=None, headers=user_agent)
page = request.urlopen(req)

#Get text of page with soup
soup = BeautifulSoup(page, features='lxml')
#Kill all script and style elements
for s in soup(["script", "style"]):
    s.extract()
txt = '\n'.join(soup.get_text().splitlines())

print(txt)

Quote:

i'm getting some extra blank lines in some places
You can remove them.

Skaperen 07-02-2020 10:15 AM

right. i'm not worried about the 300,000 blank lines. i will be parsing this text in my own script which can ignore blank lines.


All times are GMT -5. The time now is 08:35 PM.