getting a long text stream of a long web page
i have a web page that has lots of HTML and binary bytes in it that displays on Firefox as a plain-ole text file. if i access it with lynx it also comes out fine but it entirely pagified (i have to press space to get each page). curl gives me the HTML and binary characters.
what i really want is to get a continuous text stream of the whole thing. is there a tool that can do that? i have the HTML downloaded, so something that only works from a local file could be fine. |
"lynx -dump"?
Some other options are listed here: Convert from HTML to formatted plain text using Lynx |
that looks good. i'll have to try it later today.
|
i'm getting some extra blank lines in some places, though not all. the processing i planned to is going to ignore them, anyway, do lynx -dump works fine for this.
|
Quote:
Runs of blank lines can be squeezed with cat -s. |
Code:
url="https://www.linuxquestions.org/questions/linux-general-1/getting-a-long-text-stream-of-a-long-web-page-4175677801/" LQTest1.py Code:
#! /usr/bin/python LQTest2.py Code:
#! /usr/bin/python Quote:
|
right. i'm not worried about the 300,000 blank lines. i will be parsing this text in my own script which can ignore blank lines.
|
All times are GMT -5. The time now is 08:35 PM. |