LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Help with a script to parse youtube search output, present data and links (https://www.linuxquestions.org/questions/linux-software-2/help-with-a-script-to-parse-youtube-search-output-present-data-and-links-4175657763/)

jamtat 07-20-2019 04:03 PM

Help with a script to parse youtube search output, present data and links
 
I'd like to put together a script that will parse youtube search output and present to the user a menu of results that includes things like video duration, title, and direct links (such as could be fed to youtube-dl). I'm partway there using cURL and some grep-fu: I can construct a search which can be fed to cURL to get the search results page. The grep-fu I've been experimenting with gets rid of a lot of the gobbledy-gook from the results page, selecting a single very long line between html tags that contains each of the needed elements and doing an additional grep to exclude erroneous results the search page presents. But even though I've narrowed down the content by about 90% a lot of extraneous material still needs to be stripped out.

So what I need most help with is excising just the relevant information from the results page: the direct link (between quotation marks after an <a href= tag), the title (between quotation marks after title=), and the duration (Duration: plus the following 5 spaces/digits). Then with presenting/formatting that in a sort of menu. Something like sed or awk should be a lot more powerful and flexible than grep in identifying and excising target text, but my abilities in those areas is pretty limited--which is why I started out by using grep. In any case, input on accomplishing the task will be appreciated.

Sample search results URL that can be cURL'd to a local file using a link like the following (fictitious) sample:
Code:

curl -o YTsearchresults https:// www dot youtube dot com/ results?search_query=mycool+video+7%2F18%2F15
(assume new results show up daily--thus the search by date)

Sample of partly parsed output I get by using grep commands with which I've experimented:

Code:

<h3 class="yt-lockup-title "><a href="/watch?v=axGC_56LMdc" class="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink      spf-link " data-sessionlink="zfit=CDIQ3DAYEyIKCMyLro-UwONCFUL3AwudgeYNKPj0JDIGc2VycmNoUhZ0dQNrZXIsY2FybKNvbiA3LzT4LzE4"  title="My Cool video" aria-describedby="description-id-889672" rel="spf-prefetch" dir="ltr"><span aria-label="My Cool by Someone 3 months ago 27
minutes, 16 seconds 91,148 views">My Cool video</span></a><span class="accessible-description" id="description-id-889672"> - Duration: 27:16.</span></h3><div class="yt-lockup-byline ">

The watch?v=etc part can be tacked onto the end of the youtube URL to get a direct link to the desired video. The title would be probably the easiest part to excise since it follows the title= tag and is surrounded by quotation marks. Duration: 27:16 should be straightforward to grab as well. Finally, the youtube search results page seems always to contain some irrelevant results that need to be filtered by title. I used a second grep pipe to remove those by doing something like | grep Cool

A sample of the full grep commands with which I've experimented that remove the bulk of extraneous gobbledy-gook would look something like:
Code:

grep -o '<h3 class="yt-lockup-title ">.*</span></h3><div class="yt-lockup-byline ">' YTsearchresults | grep Cool
I haven't yet gotten to the point of making the data presentable to the user as something like;

1. My Cool video, Duration 15:45, https:// www dot youtube dot com forward slash watch?v=saGM_5OZMqi
2. My Cool video, Duration 10:04, https:// www dot youtube dot com forward slash watch?v=xgJb_2OHRdy
3. My Cool video, Duration 22:13, https:// www dot youtube dot com forward slash watch?v=wlJM_54ESpc

(I'm trying to avoid making invalid URL's: obviously https:// www dot youtube dot com forward slash watch?v= should actually be a valid link to a youtube video)

PS I'm aware that it's considered bad form or whatever to use text querying tools and regex to parse html and I more or less understand why (the target web coding is sure to change sooner or later and so break things). I've fiddled a little bit with xmllint in pursuing my aims but the fact of the matter is that I at least have some basic experience with and knowledge about tools like grep, sed, and awk, while I am starting from 0 when it comes to using xml/html parsing tools. So I'm not trying to rule out using such tools for this task, just beginning where I feel a bit more comfortable/competent.

Here's how someone did something like this in python: https://stackoverflow.com/questions/...e-search-query

Somewhat similar to what I'm aiming for: https://www.commandlinefu.com/comman...utube-playlist

Another, more outdated resource: https://evilshit.wordpress.com/2013/...-from-youtube/

individual 07-20-2019 04:24 PM

There is already a program written in Python called mpsyt. It acts as the front-end to Youtube, and relies on Pafy to communicate with the Youtube API. If I'm not mistaken it uses youtube-dl to download/stream videos.

If you don't want to go that route, though, you might want to check out pup, which is a command-line HTML parser with CSS selectors. There are several examples of how to use it on the Github page.

EDIT: Or if you really want to only use coreutils, here's something to get you started.
Code:

f='data.html'

# find and store the URL/title line.
match="$(grep -F 'watch?v=' "$f")"

# extract the URL from the line.
url="$(grep -oE '/watch\?v=[^"]+' <<< "$match")"

# extract the title from the line.
title="$(grep -oE 'title="[^"]+"' <<< "$match")"

# the duration is on another line. try to match it based on the text "Duration."
# this will fail if you aren't using an English locale.
# EDIT2: PCRE is better than extended regex here.
duration="$(grep -oP 'Duration: [0-9]+(?::[0-9]+)+' "$f" | cut -d' ' -f2)"

Output
Code:

/watch?v=axGC_56LMdc
title="My Cool video"
27:16


jamtat 07-20-2019 04:43 PM

Thanks for pointing that out, individual. I actually think I ran across that at some point as I was researching this. But I obviously didn't look into it very carefully. It does look like it should do what I need so I'll start experimenting with it soon. The project I began researching and beginning to implement is still an interesting one I'd like to resolve. Despite the fact that there may already be a wheel, it can still be a valuable learning experience to try and invent a new one. So I still welcome input on the project I posted about here. Bearing in mind, of course, that as things stand the mps solution is likely to be a lot less fragile/prone to breakage, since it apparently uses python libraries designed from the ground up to interface with web pages.

LATER EDIT: thanks for later posting tips on how to accomplish my aim using alternate tools.

jamtat 07-22-2019 03:43 PM

Continuing to experiment with resolving issues, though it does look as though mpsyt does pretty much what I want. Probably most important in attempting to adapt individual's directives to my situation is the fact that the input file consists in many lines.

What I've come up with is a two-step process that first grabs the youtube search results page, does an initial filtering, and saves the results to a local file. As follows:
Code:

curl https:// www dot youtube dot com/ results?search_query=mycool+video+7%2F18%2F15 | grep Cool >YTsearchresults
I've adapted individual's suggested code to iterate through all lines in the resulting file and to show sufficient information in the terminal, once the script is run on the resulting file. I found that, since I do an initial filtering when the search results page is cURL'd, the match variable used in his sample is unneeded. The content of the modified script is as follows:
Code:

#!/usr/bin/bash
f='YTsearchresults'
while IFS= read -r line; do

# extract the title from the line.
title="$(grep -oE 'title="[^"]+"' <<< "$line"  |cut -c8- | cut -d' ' -f1-3)" #(the cut commands whittle down titles, which can get way too long)
echo Title\: $title--

# extract the URL from the line.
url="$(grep -oE '/watch\?v=[^"]+' <<< "$line")"
echo https://www.youtube.com$url

# the duration is on another line. try to match it based on the text "Duration."
# this will fail if you aren't using an English locale.
# EDIT2: PCRE is better than extended regex here.
duration="$(grep -oP 'Duration: [0-9]+(?::[0-9]+)+' <<< "$line" | cut -d' ' -f2)"
echo Duration=$duration
printf "\n" #add newline between each result for easier reading of output

done < "$f"

It's pretty kludgy with the multiple echo and grep commands but it does, at least under current youtube search page conventions, get the job done. The output of the script is kind of clunky, with title, duration, and url ending up on different lines. So it doesn't look nearly as nice as mpsyt output. Plus, viewing the relevant url's is a manual process that requires copying and pasting the url into some downloader/viewer. All that is handled far more smoothly by mpsyt. Plus, as mentioned, mpsyt is bound to be more robust over the long run, since it would likely adapt much more readily to updates in web technology.

In any case, if someone with fairly modest technical acumen will become interested in doing something along these lines, these rudiments could serve as a good starting point. It should be pretty easy to enhance and extend on the basis laid out.


All times are GMT -5. The time now is 07:34 PM.