[SOLVED] Extract text from a text file to put in a variable

individual · 05-01-2019, 07:54 AM

Quote:

Originally Posted by gilesaj001

I never did this before. I use mainly gui except for commands that I use all the time on the server.

That would explain a lot. Ok, go ahead and type rm -rf pup-master, as that contains the source code for the program, but not the program.
Assuming you have a 64 bit system, download this version of it.
If you have a 32 bit system, download this one.
Unzip it and place it somewhere. I have a bin directory in my home directory that I place one-off programs in. If you decide to do that, add "$HOME/bin" to your path.

teckk · 05-01-2019, 07:58 AM

https://www.dropbox.com/s/gxy3vd7o3r...alert.txt?dl=0
I downloaded that source to test.html. You won't get the content that you are wanting from that page unless you run the scripts on that page. So curl won't help. You'll need something that runs scripts. And then there is a log in that must be passed before you can get the text. So even if you filled out that form with curl, you still aren't going to get the content because it is script delivered.

I got the source with scripts run using python/webengine. You could use soup, selenium, nodejs, whatever you want.

Code:

#! /usr/bin/env python

#Get source with scripts run using Python3/PyQt5/qt5-webengine
#Usage:
#script.py <url> <local filename>
#or script.py and answer prompts

import sys
from PyQt5.QtWebEngineWidgets import (QWebEnginePage, 
                        QWebEngineProfile, QWebEngineView)
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

agent = ('Mozilla/5.0 (Windows NT 10.0; WOW64; rv:65.0)'
        ' Gecko/20100101 Firefox/65.0')

class Source(QWebEnginePage):
    def __init__(self, url, _file):
        self.app = QApplication([])
        QWebEnginePage.__init__(self)
        
        self.agent = QWebEngineProfile(self)
        self.agent.defaultProfile().setHttpUserAgent(agent)
        
        self._file = _file
        self.load(QUrl(url))
        self.loadFinished.connect(self.on_load_finished)
        self.app.exec_()
        
    def on_load_finished(self):
        self.html = self.toHtml(self.write_it)

    def write_it(self, data):
        self.html = data
        with open (self._file, 'w') as f:
            f.write (self.html)
        print ('\nFinished\nFile saved to ' + (self._file))
        self.app.quit()

if __name__ == '__main__':
    #Open with arguments or prompt for input
    if len(sys.argv) > 2:
        url = (sys.argv[1])
        _file = (sys.argv[2])
    else:
        url = input('Enter/Paste url for source: ')
        _file = input('Enter output file name: ')
    Source(url, _file)

I opened that file with dillo, and could see all the info I wanted.

You would do better to post your source to someplace like:

Code:

cat test.html | curl -F 'sprunge=<-' http://sprunge.us
http://sprunge.us/StUKF3

And that is what I did with that source.
http://sprunge.us/StUKF3

Then it's just a text file and you can parse easy enough.

Parse that however you wish. Get it from there and save it to file.html and parse the html file. A html file is just text. You can use awk to parse tags on a html file, or a little python. Another words, get that source to file and practice parsing that file.

dugan · 05-01-2019, 08:15 AM

To download Pup, click on "Releases" and then click on "pup_v0.4.0_linux_amd64.zip". That zip file should have an executable binary in it.

Most distros have ~/.local/bin in the PATH. You put the pup binary there. You might need to run "rehash" afterwards.

gilesaj001 · 05-01-2019, 08:45 AM

Thanks I have it working now but the text is too much. I will have to look at it further and learn how pup works

Thanks for all your help

EDIT: This is what I ended up doing on the web page:

<p> "If there is an Alert or Warning it will appear under this text"</p>

<?php
$message=0;
$message=shell_exec("PATH to script/find-warnings.sh 2>&1");
if (empty($message))
{
echo "No Alerts in Effect";
}
else
{
print_r($message);
}
?>

This what I used to get the Name of the Alert to put on the image.

WARN2=`grep "col-xs-10" $weatherFile | awk -F'>' '{print $2}' | sed 's|</div| |g'`
WARN3=`echo $WARN2 |cut -c1-26`
echo "Warning is this " $WARN3

http://dingo-den.com/index.php?nav=cam1

gilesaj001 · 09-19-2022, 04:05 AM

Quote:

Originally Posted by gilesaj001

The contents of the file is what you posted above: -rwxr-xr-x 1 root root 298 May 1 07:42 test.sh

#!/bin/bash

baseUrl="https://weather.gc.ca"

weatherData="$(curl -s $baseUrl/city/pages/on-118_metric_e.html)"
alertUrl="$(pup '.alert-item > a attr{href}' <<< $weatherData | head -1)"
[[ -n "$alertUrl" ]] || exit
alertData="$(curl -s $baseUrl$alertUrl | pup 'ul + p text{}')"

echo "$alertData"

This script has been working since I started using it in May last year. For some reason it no longer finds the text I am looking for.
The line that is not working I think is

Code:

 alertUrl="$(pup '.alert-item > a attr{href}' <<< $weatherData | head -1)"

I am not a programmer and tried to read up on pup but could not make head nor tails of it.

As of time of posting there is a wether statement on the web site that should be picked up but isn't

Any hel appresiated.

boughtonp · 09-19-2022, 08:01 AM

Quote:

Originally Posted by gilesaj001

For some reason it no longer finds the text I am looking for.
The line that is not working I think is

What PRECISELY is not working - i.e. what text is it finding instead? What leads you to believe it is that line that's failing?

Quote:

Code:

 alertUrl="$(pup '.alert-item > a attr{href}' <<< $weatherData | head -1)"

The pup command is simply two instructions - the first bit ".alert-item > a" is standard CSS selector syntax to filter to a specific a "<a" tag. (The A tag is used for hyperlinks).

The second part "attr{href}" is Pup-specific, but it simply reads the value of the href attribute of the selected tag, which means it will output the URL.

It's very possible the HTML structure has changed slightly and caused it to fail - either by failing to select, or by selecting multiple (and then head outputting the wrong one; as an aside that head syntax is obsolete and should be "head -n1" instead).

However, it could also easily be one of the other parts failing - maybe curl is not succeeding; having -s without -S means that any errors there would be suppressed - it's a good idea to use both ("-sS") so that a message is printed to stdout if something unexpected happens.

gilesaj001 · 09-19-2022, 08:33 AM

This is the Data:

weatherData="$(curl -s $baseUrl/city/pages/on-118_metric_e.html)"
echo $weatherData

I had to put it in a file on my server because it was too big. http://dingo-den.com/weather_text.txt

alertUrl="$(pup '.alert-item > a attr{href}' <<< $weatherData | head -1)"
echo $alertUrl

There is nothing in "alertUrl"

It is supposed to find

Code:

 href="/warnings/report_e.html?on41#1251147931110540001202209180503wz8889cwto"

Thanks for your help.

gilesaj001 · 09-22-2022, 10:49 PM

OK I have been trying to get the data from the file but have been unsuccessful.

What I want to do is grab the first occurrence of href from this file http://dingo-den.com/weather_text.txt

That starts with

Code:

/warnings/report_e.html

and save the complete url to a variable. The href in the file is

Code:

 href="/warnings/report_e.html?on41#1251147931110540001202209180503wz8889cwto"

and it changes when the warning changes. In this case I want to save

Code:

/warnings/report_e.html?on41#1251147931110540001202209180503wz8889cwto

to a variable called alertUrl

It does not have to use pup I am fine using anything that will extract the url.

As I said before I am not a programmer, just a 73 year old fart that has been playing around with computers since the 60's but never could code a dam.

Any help appresiated.

teckk · 09-23-2022, 10:14 AM

Couple of quick examples. You are going to have to put your nose in the docs for the tool that you want to use. You could use re to parse that further.

Code:

from html.parser import HTMLParser
import urllib.request

url = "https://weather.gc.ca/city/pages/on-118_metric_e.html"

class LinkScrape(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0] == 'href':
                    link = attr[1]
                    print('- ' + link)

if __name__ == '__main__':
    request_object = urllib.request.Request(url)
    page_object = urllib.request.urlopen(url)
    link_parser = LinkScrape()
    link_parser.feed(page_object.read().decode('utf-8'))

Code:

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests
import re

url = "https://weather.gc.ca/city/pages/on-118_metric_e.html"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

teckk · 09-23-2022, 10:53 AM

Another example with simple tools.

Code:

url="https://weather.gc.ca/city/pages/on-118_metric_e.html"

weatherData=$(curl "$url")

echo "$weatherData"

echo "$weatherData" | grep -io '<a[^>]\+href[ ]*=[ \t]*"[^"]\+"'

You need all of the links, not just the ones that start with http.

gilesaj001 · 09-23-2022, 10:10 PM

Quote:

Originally Posted by teckk

Another example with simple tools.

Code:

url="https://weather.gc.ca/city/pages/on-118_metric_e.html"

weatherData=$(curl "$url")

echo "$weatherData"

echo "$weatherData" | grep -io '<a[^>]\+href[ ]*=[ \t]*"[^"]\+"'

You need all of the links, not just the ones that start with http.

I tried your first two options and they had multiple errors. I am running on the command line and there is no gui installed on the server.

This one ran and did show all the href links and there were a lot of them. Now all I need to find is the one that starts with

Code:

/warnings/report_e.html

and save the whole URL to a variable.

gilesaj001 · 09-30-2022, 10:03 PM

The script is working again. I found that when I updated pup to a new version it put it in a different place than the original and somehow because there were two versions of pup it didn't work. I deleted the old one and set the PATH to the new one and it works as it used to.

Thank you to those that tried to help. It must be exhausting to try and help and old fart like me that does not know programming.