Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
A lot of podcast sites do not put a list of all their episodes in the show's rss feed meaning that an ordinary podcatcher won't do the job. Is there a way to scan an entire site (or part of a site) and pick out only the mp3 files to download?
I've tried combinations of 'wget -r $site' with no luck. All I can get wget to do is either download the entire site, or print a single page to standard output.
Okay, -A worked, but I've hit another snag. Sometimes podcasts mp3 files are stored on a separate domain (libsyn.com in the test podcast) I tried using `wget -nd -rc -H -D libsyn.com -A mp3 $site` but all that gets downloaded is index.html.
The robots.txt file comes from media.libsyn.com and bans all bots from the entire site. Maybe you could contact the webmaster(s) of libsyn.com and ask if there might be a way to allow certain well-behaved bots that do not put too much load on their servers.
Libsyn still has not contacted me yet, but I think I know what's causing the hang up. Most of the sites I've tried use wordpress which is written in PHP. What I think this means is that all the pages outside of index.html (which itself is just an auto-genorated file) needs to genorated by dynamically by PHP, so they don't really exist. (which would explain why I'm only getting that one file)
From a client point of view, it makes absolutely no difference whether a file physically exists on the server or whether it is generated on the fly the moment it is requested. If wget finds a robots.txt as the one you posted above, it will stop. PHP has nothing to do with this.
Okay, guys, may be too late to post, but this forum helped me a lot with wget, and now i know how to download all mp3 from a directory , and I had some difficulties with robots.txt. So i issue a command like this: "wget $site", to get the index file, then: "wget wget -nd -rcH -A mp3 $site -F index.html" So this helped me go around the robots.txt restriction!!!