Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A lot of podcast sites do not put a list of all their episodes in the show's rss feed meaning that an ordinary podcatcher won't do the job. Is there a way to scan an entire site (or part of a site) and pick out only the mp3 files to download?
I've tried combinations of 'wget -r $site' with no luck. All I can get wget to do is either download the entire site, or print a single page to standard output.
Okay, -A worked, but I've hit another snag. Sometimes podcasts mp3 files are stored on a separate domain (libsyn.com in the test podcast) I tried using `wget -nd -rc -H -D libsyn.com -A mp3 $site` but all that gets downloaded is index.html.
The robots.txt file comes from media.libsyn.com and bans all bots from the entire site. Maybe you could contact the webmaster(s) of libsyn.com and ask if there might be a way to allow certain well-behaved bots that do not put too much load on their servers.
Libsyn still has not contacted me yet, but I think I know what's causing the hang up. Most of the sites I've tried use wordpress which is written in PHP. What I think this means is that all the pages outside of index.html (which itself is just an auto-genorated file) needs to genorated by dynamically by PHP, so they don't really exist. (which would explain why I'm only getting that one file)
From a client point of view, it makes absolutely no difference whether a file physically exists on the server or whether it is generated on the fly the moment it is requested. If wget finds a robots.txt as the one you posted above, it will stop. PHP has nothing to do with this.
No need to feel sorry, I was just trying to explain how things work. PHP cannot make a difference, because the client (be it wget or a browser) cannot "see" the difference. If it's not the robots.txt, it might be some client-side script like JavaScript.
Okay, guys, may be too late to post, but this forum helped me a lot with wget, and now i know how to download all mp3 from a directory , and I had some difficulties with robots.txt. So i issue a command like this: "wget $site", to get the index file, then: "wget wget -nd -rcH -A mp3 $site -F index.html" So this helped me go around the robots.txt restriction!!!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.