LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   recursively download podcasts (https://www.linuxquestions.org/questions/linux-software-2/recursively-download-podcasts-492709/)

dosnlinux 10-15-2006 08:00 PM

recursively download podcasts
 
A lot of podcast sites do not put a list of all their episodes in the show's rss feed meaning that an ordinary podcatcher won't do the job.:mad: Is there a way to scan an entire site (or part of a site) and pick out only the mp3 files to download?

I've tried combinations of 'wget -r $site' with no luck. All I can get wget to do is either download the entire site, or print a single page to standard output.:scratch:

Any help I could get would be great.

Thanks in advance.

hobey 10-16-2006 05:51 AM

I haven't tried, but I'd suggest adding something like '-A mp3'.

dosnlinux 10-16-2006 04:57 PM

Okay, -A worked, but I've hit another snag. Sometimes podcasts mp3 files are stored on a separate domain (libsyn.com in the test podcast) I tried using `wget -nd -rc -H -D libsyn.com -A mp3 $site` but all that gets downloaded is index.html.

Any suggestions?

hobey 10-19-2006 05:43 AM

Try without the space between -D and libsyn.com:

wget -nd -rcH -Dlibsyn.com -A mp3 $site

dosnlinux 10-19-2006 06:23 PM

Now all I'm getting is a robots.txt file.

Contents of robots.txt:
User-agent: *
Disallow: /

I'm not trying to access / though. The actual site I'm using is http://rootsmart.com/category/podcasts

hobey 10-20-2006 05:07 AM

The robots.txt file comes from media.libsyn.com and bans all bots from the entire site. Maybe you could contact the webmaster(s) of libsyn.com and ask if there might be a way to allow certain well-behaved bots that do not put too much load on their servers.

dosnlinux 10-20-2006 02:06 PM

Thanks. I will go ahead and do that,and let you know how things turn out.

dosnlinux 11-23-2006 07:38 PM

Libsyn still has not contacted me yet, but I think I know what's causing the hang up. Most of the sites I've tried use wordpress which is written in PHP. What I think this means is that all the pages outside of index.html (which itself is just an auto-genorated file) needs to genorated by dynamically by PHP, so they don't really exist. (which would explain why I'm only getting that one file)

So how can I get the site to create the files?

hobey 11-24-2006 07:10 AM

From a client point of view, it makes absolutely no difference whether a file physically exists on the server or whether it is generated on the fly the moment it is requested. If wget finds a robots.txt as the one you posted above, it will stop. PHP has nothing to do with this.

dosnlinux 11-24-2006 08:08 AM

Sorry, the PHP post came from trying `wget -r $site -O -` again. So libsyn should not have been involved this time.

hobey 11-25-2006 04:01 AM

No need to feel sorry, I was just trying to explain how things work. PHP cannot make a difference, because the client (be it wget or a browser) cannot "see" the difference. If it's not the robots.txt, it might be some client-side script like JavaScript.

nikonaum 04-16-2007 05:34 AM

Okay, guys, may be too late to post, but this forum helped me a lot with wget, and now i know how to download all mp3 from a directory :), and I had some difficulties with robots.txt. So i issue a command like this: "wget $site", to get the index file, then: "wget wget -nd -rcH -A mp3 $site -F index.html" So this helped me go around the robots.txt restriction!!!


All times are GMT -5. The time now is 07:15 AM.