Script

orangepeel190 · 07-10-2020, 01:48 AM

Hi!

I am trying to draft up a script to search a website for a file name that changes week to week.

The file names extension is .mp3 (Simple!)
One part of the file name does not change (again - simple)

The part that is testing is the file name is uploaded in different format each week. Some times is:

#FILE="`date -dsunday +'%d-%m-%Y'`-xxx_news.mp3"
#FILE="`date -dsunday +'%d-%m-%Y'`_xxx_news_64.mp3"
#FILE="`date -dsunday +'%d%m%Y'`_xxx_news_64.mp3"
#FILE="`date -dsunday +'%d%m%Y'`-xxx_news_64.mp3"
#FILE="xxx_news_64-`date -dsunday +'%d-%m-%Y'`.mp3"
#FILE="xxx_news_32-`date -dsunday +'%d-%m-%Y'`.mp3"
#FILE="xxx_news_64_`date -dsunday +'%d-%m-%Y'`.mp3"
#FILE="xxx_news_32_`date -dsunday +'%d-%m-%Y'`.mp3"
#FILE="xxx_news_64-`date -dsunday +'%d%m%Y'`.mp3"
#FILE="xxx_news_64_`date -dsunday +'%d%m%Y'`.mp3"

As you can see, there are some parts of the file name on the website that remain constant - yet some (inc the date string) changes.....

I am seeking assistance in drafting up a line fo the download script to be able to find the *news*.mp3 file on the website and download it to my system - where it will cp/mv the file to a dedicated location and save it under a "known" filename.

Thoughts/comments/suggestions?

Thank you

individual · 07-10-2020, 03:12 AM

What format are the file links being served in? Is it HTML, JSON, plain text? What language did you want/need to write your script in?

Is the line '#FILE="`date -dsunday +'%d-%m-%Y'`-xxx_news.mp3"' supposed to have a '_32' in it? The other lines had either _32 or _64.

orangepeel190 · 07-10-2020, 03:17 AM

Thanks for your reply.

The file is located on a Webpage - so we will have to navigate HTML.

As for the file name - yes, sometimes it is 32bit (hence the 32) or sometimes 64bit (hence the 64). I have seen some files that have not had either 32/64 contained ... that is why I am trying to narrow down and locate the file (current Sunday date) and search by *news* and *.mp3.

I am struggling how to get it to work out .... Especially when the file name format changes (depending on the uploader - which we cannot control).

Thanks

individual · 07-10-2020, 03:20 AM

Could you provide the actual HTML from the website, or a close mock-up if that's not possible? What programming language are you using?

Turbocapitalist · 07-10-2020, 03:25 AM

Do you have access to the file system there or do you have to use HTTP / HTTPS?

I'm not sure of another way than to just guess at the names that might be there and try them all each time using --input-file with wget or --files-from with rsync.

Code:

. . .

d=$(date -d 'last sunday' +'%Y%m%d')

echo D=$d

tmp=$(tempfile --prefix="tmp." --suffix="-$d")
dir=$(mktemp --directory --suffix="-$d")

# clean up temp file and directory upon any type of EXIT
trap 'rm -f "$tmp"; rm -rf "$dir";' 0

# make an exhaustive list of possible file names
cat << EOF > "$tmp"
$(date -d $d +"%d-%m-%Y-xxx_news.mp3")
$(date -d $d +"%d-%m-%Y_xxx_news_64.mp3")
$(date -d $d +"%d%m%Y_xxx_news_64.mp3")
$(date -d $d +"%d%m%Y-xxx_news_64.mp3")
$(date -d $d +"xxx_news_64-%d-%m_%Y.mp3")
$(date -d $d +"xxx_news_32-%d-%m_%Y.mp3")
$(date -d $d +"xxx_news_64_%d-%m_%Y.mp3")
$(date -d $d +"xxx_news_32_%d-%m_%Y.mp3")
$(date -d $d +"xxx_news-64_%d%m%Y.mp3")
$(date -d $d +"xxx_news_64_%d%m%Y.mp3")
EOF

. . .

Then a mv or cp from the temporary directory using wildcards can convert the file (hopefully there is just one) to a standardized name.

shruggy · 07-10-2020, 03:51 AM

Debian has an awsome tool, uscan (part of the package devscripts). It probably can be repurposed for this. See options --watchfile and --package.

I would think of a watch file similar to this:

Code:

version=4
opts="uversionmangle=s/^(\d\d)-?(\d\d)-?(\d{4})$/$3$2$1/" \
https://example.com/path/to/foo.html \
files/(\d\d-?\d\d-?\d{4})?[-_]?\w+_news[-_]?(?:32|64)?[-_]?(\d\d-?\d\d-?\d{4})?\.mp3 \
20200101

orangepeel190 · 07-10-2020, 07:25 AM

Thanks for the update.....

It would be simpler if the file naming was standardised however this is the dilemma.....the only parts of the file name that I am seeing as remaining would be the .mp3 extension and the word “news”.

The placement of the date string and the “-“ or “_” has been proving problematic. The date string (if kept in the same position) is fine but there seems to be no consistency in the file naming process.

I am just mindful of being not to selective (grep) as there could be many other files with the mp3 extension that I don’t want to pick up. I am aiming to grab the latest (date file) and download that to my system.

shruggy · 07-10-2020, 08:06 AM

As individual said in #4 above, if you could provide (the relevant part of) the HTML code of the webpage, that would be helpful. Or maybe the link.

I'm thinking on something similar to this naming pattern. Am I right?

orangepeel190 · 07-10-2020, 06:22 PM

Thanks for the feedback.

It would be far simpler if the file name was standardised - but having to list the potential potentials as filenames will make it hard. Knowing my luck, the filename will change again and I will miss it as its not listed or covered in the variable filename list.

Originally the file could be found on a HTML page but I have done further digging.
Finding the original file location has been quite the task. It is attached to a RSS feed - which I believe I have found the original feed.

https://wmrct5.podcaster.de/qnews.rss

I have been looking at the "enclosure type" segment of the feed to extract the audio file - hence this is where the filename changes from week to week. The only part that seems the same, and which I would like to search by, would be ".mp3" and "news" or "qnews". I am thinking as long as "news" and ".mp3" are in the same line, it should be able to download the file from the storage location (which does not appear to change) and then save it locally on my system and cp/mv to a location/filename that suits.

Appreciate your assistance and working with me to overcome this conundrum in searching for a file by two variables

Cheers

individual · 07-10-2020, 08:02 PM

Quote:

Originally Posted by orangepeel190

Thanks for the feedback.

It would be far simpler if the file name was standardised - but having to list the potential potentials as filenames will make it hard. Knowing my luck, the filename will change again and I will miss it as its not listed or covered in the variable filename list.

Originally the file could be found on a HTML page but I have done further digging.
Finding the original file location has been quite the task. It is attached to a RSS feed - which I believe I have found the original feed.

https://wmrct5.podcaster.de/qnews.rss

I have been looking at the "enclosure type" segment of the feed to extract the audio file - hence this is where the filename changes from week to week. The only part that seems the same, and which I would like to search by, would be ".mp3" and "news" or "qnews". I am thinking as long as "news" and ".mp3" are in the same line, it should be able to download the file from the storage location (which does not appear to change) and then save it locally on my system and cp/mv to a location/filename that suits.

Appreciate your assistance and working with me to overcome this conundrum in searching for a file by two variables

Cheers

Thanks for providing sample data. A better 'key' to search for would be type="audio/mpeg".

Code:

<enclosure type="audio/mpeg" length="5761233" url="https://wmrct5.podcaster.de/qnews/media/05072020-vk4_qnews_64$
mp3"/>

You can use AWK, Perl, or Shell operators to isolate those URLs. Here is an example using Perl.

Code:

perl -aE 'm!audio/mpeg! && m!url="([^"]+)"! && say $1' qnews.rss

Which returns:

Code:

https://wmrct5.podcaster.de/qnews/media/05072020-vk4_qnews_64.mp3
https://wmrct5.podcaster.de/qnews/media/28062020-vk4_qnews_64.mp3
https://wmrct5.podcaster.de/qnews/media/vk4_qnews_32-21-06-2020.mp3
https://wmrct5.podcaster.de/qnews/media/14062020-vk4_qnews_64.mp3
https://wmrct5.podcaster.de/qnews/media/07062020-vk4_qnews_64.mp3

EDIT:
I think it's safe to assume the first matched URL is the latest episode. With that in mind, just exit the script after printing the first match.

Code:

perl -aE 'm!audio/mpeg! && m!url="([^"]+)"! && say $1 and exit' qnews.rss

orangepeel190 · 07-10-2020, 10:03 PM

I am assuming that the latest would(should) be named correctly with the date string. Highly dependent on the person doing up uploading and file naming.
I see your point in searching by "audio/mpeg" - with the string and location.

I will try and get something working in a bash environment.... I think that I am going to have to use AWK to get it working in a bash script?
I will probably exit the script after the first download (assuming that it is the most recent as loaded to the RSS feed).

Could I use curl or wget?

curl 'https://wmrct5.podcaster.de/qnews.rss' | awk '/audio/mpeg/{system("wget -nc "$2);exit}' FS="

Then out put the file to /save/file/here/news.mp3

Just having troubles working out where I am going wrong here ..... Trying at adapt an existing script to this purpose (clearly not working, but Im giving it a go...)

orangepeel190 · 07-10-2020, 11:45 PM

Can someone check this for any tip/suggestions ... it seems to work ok...although a little clunky

/usr/bin/lynx -source https://wmrct5.podcaster.de/qnews.rss > news.rss

news_work=`grep -i mp3 news.rss | cut -d""" -f6 | head -n1`
process="Get NEWS PODCAST File - $news_work"
echo "Fetching $news_work from Server "

/usr/bin/curl -SkLo news.mp3 $news_work

Turbocapitalist · 07-11-2020, 12:11 AM

The middle part there with AWK is scraping the feed and thus brittle. It will break when the spacing or other layout changes. You might consider a simple perl script in that section to properly parse the feed instead:

Code:

#!/usr/bin/perl -T                                                              
use XML::Feed;
use strict;
use warnings;

my $file = shift || '/dev/stdin';

my $feed = XML::Feed->parse($file)
    or die(XML::Feed->errstr);

my $feed_title = $feed->title;

foreach my $entry ($feed->entries) {
    my $mp3 = $entry->enclosure->url;
    print $mp3,qq(\n);
}

exit(0);

From there you can send the output to curl or wget, or call one of them from within perl after additional parsing or pattern matching.

orangepeel190 · 07-11-2020, 12:45 AM

I’m sorry, you’ve lost me..... fairly new to this and though to give it a go.
I’ve really only had limited experience with bash scripts..... didn’t think a Perl script could be run inside a bash script (bin/bash)

Turbocapitalist · 07-11-2020, 03:45 AM

The shell script can call the perl script, just like it could call any other program or script. So if you had the above perl script in /usr/local/bin/newsfeed.pl then you could call it like this:

Code:

#!/bin/sh

PATH=/bin:/usr/bin:/usr/local/bin

set -e

lynx -source https://wmrct5.podcaster.de/qnews.rss > news.rss

news_work=$(newsfeed.pl news.rss)
process="Get NEWS PODCAST File - $news_work"
echo "Fetching $news_work from Server "

curl -SkLo news.mp3 $news_work

exit 0

Of course, since perl grew up around this kind of thing, you could do it all (including the rename) within perl with not too many lines extra.