LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-14-2019, 11:47 AM   #1
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
The power of python: Web scraping.


Firstly: Check the laws where you live to see if scraping a web page for some kind of info is legal. If not, don't scrape it.

There are always question on LQ asking how to scrape something from somewhere. And they get answered over and over again. There are many tools to achieve that. Python, beautiful soup, selenium, phantomjs, perl, even bash and friends can scrape pages sometimes.

Usually one needs to write a script for each particular site. But is there a one size fits all way to scrape pages. Yes.

The most basic way is to open the web inspector from your web browser, load a page, look through the output for what you are looking for. That's a little slow and cumbersome though.

If a web browser can parse all scripts and load all content...
Can I get the info from the web browser and dump to terminal/file?
And parse what is being dumped to terminal for <search term>?
Or use a simple text editor to search that log file?
Or get the cookie that I can use with curl to authenticate and download something? Yes.

I'll be using python3, qtwebengine, and pyqt5 for these basic examples.
Why? Because that's what I am using at this time.

Dump all requests a web page makes to terminal and log to file.
Code:
#! /usr/bin/env python

#Print web page requests to terminal and file.

#Import only what you need
import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import (QWebEngineView, 
                            QWebEnginePage, QWebEngineProfile)
#Get web page url                            
URL = input('Enter/paste url to inspect: ')

#User agent for requests
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0) '
            'Gecko/20100101 Firefox/67.0')

#Intercept requests, print to terminal and file
class UrlRequestInterceptor(QWebEngineUrlRequestInterceptor):
    def interceptRequest(self, info):
        req = info.requestUrl()
        req2str = req.toString()
        print('\n' + req2str)
        #Append info to log file
        with open('myinspector.log', 'a') as f:
            f.write(req2str + '\n\n')

class WebEnginePage(QWebEnginePage):
    def acceptRequest(self):
        return QWebEnginePage.acceptRequest(self)

#Make a browser window, load page
if __name__ == "__main__":
    app = QApplication(sys.argv)
    
    interceptor = UrlRequestInterceptor()
    profile = QWebEngineProfile()
    profile.setHttpUserAgent(agent) #Set user agent
    profile.setRequestInterceptor(interceptor)
    
    browser = QWebEngineView()
    page = WebEnginePage(profile, browser)
    page.setUrl(QUrl(URL))
    page.setZoomFactor(1.2) #Zoom
    
    browser.setPage(page)
    browser.setMinimumSize(1000,800) #Browser size
    browser.show()
    sys.exit(app.exec_())
Get a dictionary of the cookies that a page sets.
Code:
#!/usr/bin/env python

import sys
from PyQt5.QtNetwork import QNetworkCookie
from PyQt5.QtWebKitWidgets import QWebView
from PyQt5.QtCore import QUrl, QByteArray, Qt
from PyQt5.QtWidgets import (QMainWindow, QWidget, 
                                    QApplication, QTextEdit)
from PyQt5.QtWebEngineWidgets import (QWebEngineView, 
                            QWebEngineProfile, QWebEnginePage)
        
url = input('Enter/Paste url for cookies: ')

class MainWindow(QMainWindow):
    def __init__(self, *args, **kwargs):
        QMainWindow.__init__(self, *args, **kwargs)
        self.webview = QWebEngineView()
        profile = QWebEngineProfile("storage", self.webview)
        cookie_store = profile.cookieStore()
        cookie_store.cookieAdded.connect(self.onCookieAdded)
        self.cookies = []
        webpage = QWebEnginePage(profile, self.webview)
        self.webview.setPage(webpage)
        self.webview.load(QUrl(url))
        self.setCentralWidget(self.webview)

    def onCookieAdded(self, cookie):
        for c in self.cookies:
            if c.hasSameIdentifier(cookie):
                return
        self.cookies.append(QNetworkCookie(cookie))
        self.toJson()

    def toJson(self):
        cookies_list_info = []
        for c in self.cookies:
            data = {
            "name": bytearray(c.name()).decode(), 
            "domain": c.domain(), 
            "value": bytearray(c.value()).decode(),
            "path": c.path(), 
            "expirationDate": c.expirationDate().toString(Qt.ISODate), 
            "secure": c.isSecure(),
            "httponly": c.isHttpOnly()
            }
            cookies_list_info.append(data)
        print('\n\n' 'Cookie dictionary:')
        print(cookies_list_info)

if __name__ == '__main__':
    app = QApplication(sys.argv)
    w = MainWindow()
    w.show()
    sys.exit(app.exec_())
So, if you run that last script on https://www.linuxquestions.org/quest...ux-software-2/...You'll get something like
Code:
Cookie dictionary:
[{'name': '__cfduid', 'domain': '.linuxquestions.org', 'value': 'd72cdf693c618d9976e5539d20fe96bba1565798585', 'path': '/', 'expirationDate': '2020-08-13T11:04:20', 'secure': False, 'httponly': True}, {'name': 'bblastvisit', 'domain': 'www.linuxquestions.org', 'value': '1565798585', 'path': '/', 'expirationDate': '2020-08-13T11:04:20', 'secure': False, 'httponly': False}, {'name': 'bblastactivity', 'domain': 'www.linuxquestions.org', 'value': '0', 'path': '/', 'expirationDate': '2020-08-13T11:04:20', 'secure': False, 'httponly': False}, {'name': 'bb2_screener_', 'domain': 'www.linuxquestions.org', 'value':
......
That's a dictionary, so you can pull whatever item you want out of it.
That's a very basic script. You'll need to clear <user>/.local/share/<scriptname> and <user>/.cache/<scriptname> manually.

Get the source for a page with scripts run, not real useful on it's own, but can be parsed later.
Code:
#! /usr/bin/env python

#Get source with scripts run

import sys
from PyQt5.QtWebEngineWidgets import (QWebEnginePage, 
                        QWebEngineProfile, QWebEngineView)
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0)'
        ' Gecko/20100101 Firefox/67.0')

class Source(QWebEnginePage):
    def __init__(self, url, _file):
        self.app = QApplication([])
        QWebEnginePage.__init__(self)
        
        self.agent = QWebEngineProfile(self)
        self.agent.defaultProfile().setHttpUserAgent(agent)
        
        self._file = _file
        self.load(QUrl(url))
        self.loadFinished.connect(self.on_load_finished)
        self.app.exec_()
        
    def on_load_finished(self):
        self.html = self.toHtml(self.write_it)

    def write_it(self, data):
        self.html = data
        with open (self._file, 'w') as f:
            f.write (self.html)
        print ('\nFinished\nFile saved to ' + (self._file))
        self.app.quit()

if __name__ == '__main__':

    url = input('Enter/Paste url for source: ')
    _file = input('Enter output file name: ')
    Source(url, _file)
If you are trying to do web scraping, there is a shove in the right direction.
Happy scraping.

Anyone is welcome to contribute to the thread with perl, phantomjs, gobject, selenium, soup,..whatever you use.

Last edited by teckk; 08-14-2019 at 11:49 AM.
 
Old 08-14-2019, 02:33 PM   #2
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Original Poster
Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
PyQt5 tutorial

Code:
#! /usr/bin/env python

import urllib.request
from time import sleep

url_list = ('http://zetcode.com/gui/pyqt5/introduction/',
'http://zetcode.com/gui/pyqt5/firstprograms/',
'http://zetcode.com/gui/pyqt5/menustoolbars/',
'http://zetcode.com/gui/pyqt5/layout/',
'http://zetcode.com/gui/pyqt5/eventssignals/',
'http://zetcode.com/gui/pyqt5/dialogs/',
'http://zetcode.com/gui/pyqt5/widgets/',
'http://zetcode.com/gui/pyqt5/widgets2/',
'http://zetcode.com/gui/pyqt5/dragdrop/',
'http://zetcode.com/gui/pyqt5/painting/',
'http://zetcode.com/gui/pyqt5/customwidgets/',
'http://zetcode.com/gui/pyqt5/tetris/')

a = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0)' 
        ' Gecko/20100101 Firefox/67.0')
        
user_agent = {'User-Agent': a}

cnt = 1

for url in (url_list):
    req = urllib.request.Request(url, data=None, headers=user_agent)
    with open('PyQt5_' + str(cnt) + '.html', 'wb') as f:
        f.write(urllib.request.urlopen(req).read())
    cnt += 1
    #Sleep at least 30 between requests.
    sleep(30)
 
Old 08-14-2019, 05:33 PM   #3
individual
Member
 
Registered: Jul 2018
Posts: 234

Rep: Reputation: 176Reputation: 176
PyQT5 seems like overkill for basic scraping, but still interesting to see. I don't use Python much anymore, but when I used it I used the following (and they will take you very far):
requests
BeautifulSoup
lxml

For beginners, I would recommend studying up on CSS selectors and XPath. I prefer CSS selectors, but XPath can sometimes be easier.
 
Old 08-14-2019, 05:40 PM   #4
individual
Member
 
Registered: Jul 2018
Posts: 234

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by teckk View Post
PyQt5 tutorial

Code:
#! /usr/bin/env python

import urllib.request
from time import sleep

url_list = ('http://zetcode.com/gui/pyqt5/introduction/',
'http://zetcode.com/gui/pyqt5/firstprograms/',
'http://zetcode.com/gui/pyqt5/menustoolbars/',
'http://zetcode.com/gui/pyqt5/layout/',
'http://zetcode.com/gui/pyqt5/eventssignals/',
'http://zetcode.com/gui/pyqt5/dialogs/',
'http://zetcode.com/gui/pyqt5/widgets/',
'http://zetcode.com/gui/pyqt5/widgets2/',
'http://zetcode.com/gui/pyqt5/dragdrop/',
'http://zetcode.com/gui/pyqt5/painting/',
'http://zetcode.com/gui/pyqt5/customwidgets/',
'http://zetcode.com/gui/pyqt5/tetris/')

a = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0)' 
        ' Gecko/20100101 Firefox/67.0')
        
user_agent = {'User-Agent': a}

cnt = 1

for url in (url_list):
    req = urllib.request.Request(url, data=None, headers=user_agent)
    with open('PyQt5_' + str(cnt) + '.html', 'wb') as f:
        f.write(urllib.request.urlopen(req).read())
    cnt += 1
    #Sleep at least 30 between requests.
    sleep(30)
You can use enumerate to get the current index when iterating over a list.
Code:
for i, url in enumerate(url_list):
    ....

Last edited by individual; 08-14-2019 at 05:40 PM. Reason: Wording
 
Old 08-15-2019, 09:17 AM   #5
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Original Poster
Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
Scraping xml pages example:
Such as a rss/podcast feed. You don't need any scripts run here.

I'll make up an short example of a rss feed page.

MyPage.xml that I've already downloaded.
Code:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="common.css"?>

<rss xmlns:ibunes="http://www.ibunes.com/dtds/podcast-1.0.dtd" version="2.0">

    <channel>

        <title>The happy music group</title>

        <ibunes:author>John Smith Happy</ibunes:author>

        <description>We make all kinds of happy music.</description>

        <ibunes:image href="http://www.happy.com/happymusic.jpg" />

        <ibunes:category text="Muscic &amp; Happy">

        <ibunes:category text="Bands" />
            
        </ibunes:category>

        <language>en-us</language>

        <webMaster>John@happy.com</webMaster>

        <ibunes:keywords>Happy, Music, Bands, John Smith Happy</ibunes:keywords> 

        <item>

            <title>Song no 1 - Daily Radio Broadcast 05/09/19</title>

			<link>http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song1.mp3</link>

            <guid>http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song1.mp3</guid>

            <itunes:author>John Smith Happy</itunes:author>

            <ibunes:duration>3:32</ibunes:duration>

            <description>This is song 1 - Daily Radio Broadcast 05/09/19. Please visit www.happy.com for more information. Episode Length: 3:32</description>

            <pubDate>Tue, 15 May 2019 20:25:52 GMT</pubDate>

            <enclosure url="http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song1.mp3" length="554993" type="audio/mpeg" />

        </item>
        
        <item>

            <title>Song no 2 - Daily Radio Broadcast 05/10/19</title>

			<link>http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song2.mp3</link>

            <guid>http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song2.mp3</guid>

            <ibunes:author>John Smith Happy</ibunes:author>

            <ibunes:duration>4:25</ibunes:duration>

            <description>This is song 2 - Daily Radio Broadcast 05/09/19. Please visit happy.com for more information. Episode Length: 4:25</description>

            <pubDate>Wed, 03 Aug 2019 20:49:00 GMT</pubDate>

            <enclosure url="http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song2.mp3" length="554993" type="audio/mpeg" />

        </item>
        
    </channel>

</rss>
Parse that page for the mp3 links, and the date of broadcast.
Code:
#! /usr/bin/python
     
import xml.etree.ElementTree
import urllib.request

#Make a user agent string for urllib to use
agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0)'
        ' Gecko/20100101 Firefox/67.0')
                
user_agent = {'User-Agent': agent}

#Made up example url for xml page
url1 = ('http://happy-apple-vod.adaptive.level3.net/'
        'Audio-Podcasts/rss/happy.xml')
        
#If you already have it      
url2 = ('file:///path/to/MyPage.xml')

#Get the xml tree to parse
req = urllib.request.Request(url2, data=None, headers=user_agent)
html = urllib.request.urlopen(req)
tree = xml.etree.ElementTree.parse(html)
root = tree.getroot()

#Make list of first tags
a = []
for i in root.iter('link'):
    a.append(i.text)
    
#Make list of second tags
b = []
for i in root.iter('pubDate'):   
    b.append(i.text)

#Combine alternate lines from both lists
c = [x for y in zip(a,b) for x in y]

#Write them to file, spaced
with open('happy.log', 'a') as f:
    for i in c:
        f.write(i+"\n\n")
Code:
cat happy.log
http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song1.mp3

Tue, 15 May 2019 20:25:52 GMT

http://happy-apple-vod.adaptive.level3.net/Audio-Podcasts/rss/song2.mp3

Wed, 03 Aug 2019 20:49:00 GMT
 
Old 08-15-2019, 10:41 AM   #6
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 9,094

Rep: Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931Reputation: 3931
Quote:
Originally Posted by individual View Post
PyQT5 seems like overkill for basic scraping, but still interesting to see. I don't use Python much anymore, but when I used it I used the following (and they will take you very far):
requests
BeautifulSoup
lxml
They take you exactly until you need to execute Javascript in the page in order to scrape it. That's where teckk's example takes over.
 
Old 08-15-2019, 11:54 AM   #7
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Original Poster
Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
Real world example of scraping a script heavy page for an .mp4 that plays on it.

Code:
#! /usr/bin/env python

import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineUrlRequestInterceptor
from PyQt5.QtWebEngineWidgets import (QWebEngineView, 
                            QWebEnginePage, QWebEngineProfile)
                           
URL = 'https://www2.solarmoviex.to/watch/wagon-train-1.nr01j/lvpj9z'

agent = ('Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0) '
            'Gecko/20100101 Firefox/67.0')

class UrlRequestInterceptor(QWebEngineUrlRequestInterceptor):
    def interceptRequest(self, info):
        req = info.requestUrl()
        req2str = req.toString()
        print('\n' + req2str)
        #Append info to log file
        with open('myinsp.log', 'a') as f:
            f.write(req2str + '\n\n')

class WebEnginePage(QWebEnginePage):
    def acceptRequest(self):
        return QWebEnginePage.acceptRequest(self)
        
if __name__ == "__main__":
    app = QApplication(sys.argv)
    
    interceptor = UrlRequestInterceptor()
    profile = QWebEngineProfile()
    profile.setHttpUserAgent(agent) #Set user agent
    profile.setRequestInterceptor(interceptor)
    
    browser = QWebEngineView()
    page = WebEnginePage(profile, browser)
    page.setUrl(QUrl(URL))
    page.setZoomFactor(1.2) #Zoom
    
    browser.setPage(page)
    browser.setMinimumSize(1000,800) #Browser size
    browser.show()
    sys.exit(app.exec_())
Run the script, wait for the page to completely load, you'll see an arrow on the video collage, click that 2 or 3 times, watch the terminal for the info you seek, or search the log file.

Output in terminal and in the log:
Code:
...
https://openload.pw/stream/mBrPQfF5gbw~1565973427~<edit>~9GU-ep9z?mime=true

https://oqbkip.oloadcdn.net/dl/l/z8Ti-5wTbT39ndZp/mBrPQfF5gbw/01+The+Willy+Moran+Story.mp4?mime=true
...
Code:
wget --spider https://oqbkip.oloadcdn.net/dl/l/z8Ti-5wTbT39ndZp/mBrPQfF5gbw/01+The+Willy+Moran+Story.mp4
Spider mode enabled. Check if remote file exists.
--2019-08-15 11:43:25--  https://oqbkip.oloadcdn.net/dl/l/z8Ti-5wTbT39ndZp/mBrPQfF5gbw/01+The+Willy+Moran+Story.mp4
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving oqbkip.oloadcdn.net (oqbkip.oloadcdn.net)... 89.33.246.161, 2a04:9dc0:11:375::
Connecting to oqbkip.oloadcdn.net (oqbkip.oloadcdn.net)|89.33.246.161|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351532257 (335M) [application/octet-stream]
Remote file exists.
If it'll load in a web browser, you can get it. One script fits all.

Those scripts that use PyQt5/webengine can be run with --disable-gpu if you are having problems with webgl or a non multi-threading video device.

I've got an ancient machine that will crash with webengine. That's ok, add a little bash

Code:
#! /usr/bin/bash

read -p "Enter/Paste url to inspect: " url

/home/<user>/script.py \
--disable-gpu \
--disable-checker-imaging \
--disable-flash-stage3d \
--num-raster-threads=1 \
--video-threads=1 \
--disable-webgl \
--disable-es3-apis \
--disable-accelerated-2d-canvas \
--disable-accelerated-video-decode \
--disable-surface-synchronization \
"$url"

Last edited by teckk; 08-15-2019 at 11:55 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Introduction to python web scraping and the Beautiful Soup library LXer Syndicated Linux News 0 09-10-2018 08:42 AM
[SOLVED] Python Web Page Scraping Urls Creating A Dictionary metallica1973 Programming 3 06-07-2017 02:00 PM
python mechanize scraping questions methodtwo Programming 4 03-14-2014 10:57 AM
LXer: Web scraping with Python (Part 2) LXer Syndicated Linux News 0 09-04-2009 09:00 PM
LXer: Web Scraping with Python LXer Syndicated Linux News 0 12-03-2008 03:40 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:30 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration