LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 10-15-2006, 09:00 PM   #1
dosnlinux
Member
 
Registered: Mar 2005
Distribution: slackware 11, arch 2007.08
Posts: 154

Rep: Reputation: 30
recursively download podcasts


A lot of podcast sites do not put a list of all their episodes in the show's rss feed meaning that an ordinary podcatcher won't do the job. Is there a way to scan an entire site (or part of a site) and pick out only the mp3 files to download?

I've tried combinations of 'wget -r $site' with no luck. All I can get wget to do is either download the entire site, or print a single page to standard output.

Any help I could get would be great.

Thanks in advance.
 
Old 10-16-2006, 06:51 AM   #2
hobey
LQ Newbie
 
Registered: Feb 2006
Distribution: SuSE 10.0
Posts: 17

Rep: Reputation: 0
I haven't tried, but I'd suggest adding something like '-A mp3'.
 
Old 10-16-2006, 05:57 PM   #3
dosnlinux
Member
 
Registered: Mar 2005
Distribution: slackware 11, arch 2007.08
Posts: 154

Original Poster
Rep: Reputation: 30
Okay, -A worked, but I've hit another snag. Sometimes podcasts mp3 files are stored on a separate domain (libsyn.com in the test podcast) I tried using `wget -nd -rc -H -D libsyn.com -A mp3 $site` but all that gets downloaded is index.html.

Any suggestions?
 
Old 10-19-2006, 06:43 AM   #4
hobey
LQ Newbie
 
Registered: Feb 2006
Distribution: SuSE 10.0
Posts: 17

Rep: Reputation: 0
Try without the space between -D and libsyn.com:

wget -nd -rcH -Dlibsyn.com -A mp3 $site
 
Old 10-19-2006, 07:23 PM   #5
dosnlinux
Member
 
Registered: Mar 2005
Distribution: slackware 11, arch 2007.08
Posts: 154

Original Poster
Rep: Reputation: 30
Now all I'm getting is a robots.txt file.

Contents of robots.txt:
User-agent: *
Disallow: /

I'm not trying to access / though. The actual site I'm using is http://rootsmart.com/category/podcasts
 
Old 10-20-2006, 06:07 AM   #6
hobey
LQ Newbie
 
Registered: Feb 2006
Distribution: SuSE 10.0
Posts: 17

Rep: Reputation: 0
The robots.txt file comes from media.libsyn.com and bans all bots from the entire site. Maybe you could contact the webmaster(s) of libsyn.com and ask if there might be a way to allow certain well-behaved bots that do not put too much load on their servers.
 
Old 10-20-2006, 03:06 PM   #7
dosnlinux
Member
 
Registered: Mar 2005
Distribution: slackware 11, arch 2007.08
Posts: 154

Original Poster
Rep: Reputation: 30
Thanks. I will go ahead and do that,and let you know how things turn out.
 
Old 11-23-2006, 08:38 PM   #8
dosnlinux
Member
 
Registered: Mar 2005
Distribution: slackware 11, arch 2007.08
Posts: 154

Original Poster
Rep: Reputation: 30
Libsyn still has not contacted me yet, but I think I know what's causing the hang up. Most of the sites I've tried use wordpress which is written in PHP. What I think this means is that all the pages outside of index.html (which itself is just an auto-genorated file) needs to genorated by dynamically by PHP, so they don't really exist. (which would explain why I'm only getting that one file)

So how can I get the site to create the files?
 
Old 11-24-2006, 08:10 AM   #9
hobey
LQ Newbie
 
Registered: Feb 2006
Distribution: SuSE 10.0
Posts: 17

Rep: Reputation: 0
From a client point of view, it makes absolutely no difference whether a file physically exists on the server or whether it is generated on the fly the moment it is requested. If wget finds a robots.txt as the one you posted above, it will stop. PHP has nothing to do with this.
 
Old 11-24-2006, 09:08 AM   #10
dosnlinux
Member
 
Registered: Mar 2005
Distribution: slackware 11, arch 2007.08
Posts: 154

Original Poster
Rep: Reputation: 30
Sorry, the PHP post came from trying `wget -r $site -O -` again. So libsyn should not have been involved this time.
 
Old 11-25-2006, 05:01 AM   #11
hobey
LQ Newbie
 
Registered: Feb 2006
Distribution: SuSE 10.0
Posts: 17

Rep: Reputation: 0
No need to feel sorry, I was just trying to explain how things work. PHP cannot make a difference, because the client (be it wget or a browser) cannot "see" the difference. If it's not the robots.txt, it might be some client-side script like JavaScript.
 
Old 04-16-2007, 06:34 AM   #12
nikonaum
LQ Newbie
 
Registered: Jun 2006
Posts: 3

Rep: Reputation: 0
Okay, guys, may be too late to post, but this forum helped me a lot with wget, and now i know how to download all mp3 from a directory , and I had some difficulties with robots.txt. So i issue a command like this: "wget $site", to get the index file, then: "wget wget -nd -rcH -A mp3 $site -F index.html" So this helped me go around the robots.txt restriction!!!
 
  


Reply

Tags
download, mp3, podcasts, rss, wget


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need some help writing a script to organize my podcasts Benanzo Linux - Software 2 09-15-2006 04:02 PM
podcasts on Networking and Security topics ddaas General 0 03-29-2006 02:36 AM
LXer: Mastering podcasts with Audacity LXer Syndicated Linux News 0 03-28-2006 04:11 PM
Linux Podcasts dragonmortal Linux - General 1 08-16-2005 03:21 AM
LQ Radio, LQ Podcasts and My Blog jeremy LQ Suggestions & Feedback 18 02-01-2005 06:56 PM


All times are GMT -5. The time now is 08:54 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration