LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   The power of bash: Web scraping (https://www.linuxquestions.org/questions/programming-9/the-power-of-bash-web-scraping-4175659262/)

teckk 08-16-2019 04:07 PM

The power of bash: Web scraping
 
Only some web pages can be scraped with bash and friends. And it will require looking at the source code of each page to determine what the page is doing. Even pages that deliver content from scripts that run on the web servers, content that is not visible to the end user, can sometimes be scraped with bash and curl/grep/awk/sed. Just requires a little effort. Effort that is worth it especially if it's something that you are going to use every week.

If possible, it's a whole lot faster to scrape with bash than with python, uses less bandwidth if you can scrape a page without all those scripts running, and it's enjoyable because you have to do it, not some python module doing it for you.

Lets use an example:

Blue Sky Metropolis series.
https://www.pbs.org/video/wings-mvizww/
https://www.pbs.org/video/the-big-chill-ih9lhs/
https://www.pbs.org/video/a-space-odyssey-fxc0bj/
https://www.pbs.org/video/back-to-the-future-a7jt15/

You could scrap all of those with:
https://www.linuxquestions.org/quest...ng-4175659141/

And that will get you the info that you want. But...kind of slow, and you have to wait for python to load, and run all those scripts...and you look at gkrellm while it's loading and think "Golly, I only want the page not the whole web server!"

Is there a way that I can scrape that with bash and friends, download just a few kb of html, and be done? Yes. Sometimes.

Example for the links above.
Code:

#! /usr/bin/env bash

clear

#User agent for requests
agent="Mozilla/5.0 (Windows NT 10.0; x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"

#Get web page url
#Example: http://www.pbs.org/video/good-show-shn41u/
read -p "Enter/paste PBS video web page URL: " url

#Get PBS video url from page source
vid_url=$(curl -sLA "$agent" "$url" |
    grep -oP "https://player.pbs.org/viralplayer/[0-9]{1,20}/")

#Get redirects for videos into array
redirect=($(curl -sA "$agent" "$vid_url" |
    grep -oP "https://urs.pbs.org/redirect/[a-z0-9]{1,40}/"))

#Parse redirects in array for .mp4 and .m3u8 streams
mp4_url=$(curl -sA "$agent" "${redirect[1]}" |
    grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | head -n1)
m3u8_url=$(curl -sA "$agent" "${redirect[0]}" |
    grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | head -n1)

#Get streams in .m3u8 playlist
m3u8_streams=$(curl -sA "$agent" "$m3u8_url")

#Print output url's
echo -e "\nmp4 redirect:\n"${redirect[1]}"\n\nmp4 url:\n"$mp4_url"\n"
echo -e "\nm3u8 redirect:\n"${redirect[0]}"\n\nm3u8 url:\n"$m3u8_url"\n"

#Put some spaces in the .m3u8 streams output
echo -e "\nm3u8 streams:"
for i in $m3u8_streams; do
    echo -e ""$i"\n"
done

Once you have the info scraped, you already have curl, youtube-dl, wget to use.

Will the bash script that you make last forever? Nope. Just as soon as something on the page changes, bad script. You'll need to go back to the page source and see what has changed, and alter the script.

But it'll get you the info quicker than python can get itself loaded into RAM.

Happy scraping.

Contrapak 08-17-2019 10:19 AM

jq and pup are also your best friends.

ondoho 08-18-2019 03:56 AM

or xmllint - part of libxml2.
That's what I use.


All times are GMT -5. The time now is 06:09 PM.