LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-16-2019, 04:07 PM   #1
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 2,263

Rep: Reputation: 499Reputation: 499Reputation: 499Reputation: 499Reputation: 499
The power of bash: Web scraping


Only some web pages can be scraped with bash and friends. And it will require looking at the source code of each page to determine what the page is doing. Even pages that deliver content from scripts that run on the web servers, content that is not visible to the end user, can sometimes be scraped with bash and curl/grep/awk/sed. Just requires a little effort. Effort that is worth it especially if it's something that you are going to use every week.

If possible, it's a whole lot faster to scrape with bash than with python, uses less bandwidth if you can scrape a page without all those scripts running, and it's enjoyable because you have to do it, not some python module doing it for you.

Lets use an example:

Blue Sky Metropolis series.
https://www.pbs.org/video/wings-mvizww/
https://www.pbs.org/video/the-big-chill-ih9lhs/
https://www.pbs.org/video/a-space-odyssey-fxc0bj/
https://www.pbs.org/video/back-to-the-future-a7jt15/

You could scrap all of those with:
https://www.linuxquestions.org/quest...ng-4175659141/

And that will get you the info that you want. But...kind of slow, and you have to wait for python to load, and run all those scripts...and you look at gkrellm while it's loading and think "Golly, I only want the page not the whole web server!"

Is there a way that I can scrape that with bash and friends, download just a few kb of html, and be done? Yes. Sometimes.

Example for the links above.
Code:
#! /usr/bin/env bash

clear

#User agent for requests
agent="Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"

#Get web page url
#Example: http://www.pbs.org/video/good-show-shn41u/
read -p "Enter/paste PBS video web page URL: " url

#Get PBS video url from page source
vid_url=$(curl -sLA "$agent" "$url" | 
    grep -oP "https://player.pbs.org/viralplayer/[0-9]{1,20}/")

#Get redirects for videos into array
redirect=($(curl -sA "$agent" "$vid_url" |
    grep -oP "https://urs.pbs.org/redirect/[a-z0-9]{1,40}/"))

#Parse redirects in array for .mp4 and .m3u8 streams
mp4_url=$(curl -sA "$agent" "${redirect[1]}" |
    grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | head -n1)
m3u8_url=$(curl -sA "$agent" "${redirect[0]}" | 
    grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | head -n1)

#Get streams in .m3u8 playlist
m3u8_streams=$(curl -sA "$agent" "$m3u8_url")

#Print output url's
echo -e "\nmp4 redirect:\n"${redirect[1]}"\n\nmp4 url:\n"$mp4_url"\n"
echo -e "\nm3u8 redirect:\n"${redirect[0]}"\n\nm3u8 url:\n"$m3u8_url"\n"

#Put some spaces in the .m3u8 streams output
echo -e "\nm3u8 streams:"
for i in $m3u8_streams; do 
    echo -e ""$i"\n"
done
Once you have the info scraped, you already have curl, youtube-dl, wget to use.

Will the bash script that you make last forever? Nope. Just as soon as something on the page changes, bad script. You'll need to go back to the page source and see what has changed, and alter the script.

But it'll get you the info quicker than python can get itself loaded into RAM.

Happy scraping.
 
Old 08-17-2019, 10:19 AM   #2
Contrapak
Member
 
Registered: May 2019
Location: /home/
Distribution: Arch Linux
Posts: 131

Rep: Reputation: 51
jq and pup are also your best friends.
 
Old 08-18-2019, 03:56 AM   #3
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 12,307
Blog Entries: 9

Rep: Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309Reputation: 3309
or xmllint - part of libxml2.
That's what I use.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
The power of python: Web scraping. teckk Programming 6 08-15-2019 11:54 AM
Basic web scraping question(mechanize+BeautifulSoup) methodtwo Programming 1 03-31-2014 04:27 PM
LXer: Web scraping with Python (Part 2) LXer Syndicated Linux News 0 09-04-2009 09:00 PM
LXer: Web Scraping with Python LXer Syndicated Linux News 0 12-03-2008 03:40 PM
LXer: Extract data from the Internet with Web scraping LXer Syndicated Linux News 0 03-29-2006 12:55 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:31 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration