recursively download podcasts
A lot of podcast sites do not put a list of all their episodes in the show's rss feed meaning that an ordinary podcatcher won't do the job.:mad: Is there a way to scan an entire site (or part of a site) and pick out only the mp3 files to download?
I've tried combinations of 'wget -r $site' with no luck. All I can get wget to do is either download the entire site, or print a single page to standard output.:scratch: Any help I could get would be great. Thanks in advance. |
I haven't tried, but I'd suggest adding something like '-A mp3'.
|
Okay, -A worked, but I've hit another snag. Sometimes podcasts mp3 files are stored on a separate domain (libsyn.com in the test podcast) I tried using `wget -nd -rc -H -D libsyn.com -A mp3 $site` but all that gets downloaded is index.html.
Any suggestions? |
Try without the space between -D and libsyn.com:
wget -nd -rcH -Dlibsyn.com -A mp3 $site |
Now all I'm getting is a robots.txt file.
Contents of robots.txt: User-agent: * Disallow: / I'm not trying to access / though. The actual site I'm using is http://rootsmart.com/category/podcasts |
The robots.txt file comes from media.libsyn.com and bans all bots from the entire site. Maybe you could contact the webmaster(s) of libsyn.com and ask if there might be a way to allow certain well-behaved bots that do not put too much load on their servers.
|
Thanks. I will go ahead and do that,and let you know how things turn out.
|
Libsyn still has not contacted me yet, but I think I know what's causing the hang up. Most of the sites I've tried use wordpress which is written in PHP. What I think this means is that all the pages outside of index.html (which itself is just an auto-genorated file) needs to genorated by dynamically by PHP, so they don't really exist. (which would explain why I'm only getting that one file)
So how can I get the site to create the files? |
From a client point of view, it makes absolutely no difference whether a file physically exists on the server or whether it is generated on the fly the moment it is requested. If wget finds a robots.txt as the one you posted above, it will stop. PHP has nothing to do with this.
|
Sorry, the PHP post came from trying `wget -r $site -O -` again. So libsyn should not have been involved this time.
|
No need to feel sorry, I was just trying to explain how things work. PHP cannot make a difference, because the client (be it wget or a browser) cannot "see" the difference. If it's not the robots.txt, it might be some client-side script like JavaScript.
|
Okay, guys, may be too late to post, but this forum helped me a lot with wget, and now i know how to download all mp3 from a directory :), and I had some difficulties with robots.txt. So i issue a command like this: "wget $site", to get the index file, then: "wget wget -nd -rcH -A mp3 $site -F index.html" So this helped me go around the robots.txt restriction!!!
|
All times are GMT -5. The time now is 07:15 AM. |