LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 06-12-2012, 11:24 AM   #1
mahkoe
LQ Newbie
 
Registered: Oct 2011
Posts: 26

Rep: Reputation: Disabled
wget -r doesn't download anything except index.html


I'm trying to download all the .pdfs from a site. After a quick google search, I came up with this command to do so:

wget -r -np -A.pdf http://allpianoscores.com

When I use this command, all it does is download index.html, then promptly delete it. Is there something I'm missing in that command, or does this site have some kind of anti-robot stuff, if so, how do I get around it?
 
Old 06-12-2012, 03:36 PM   #2
ruario
Senior Member
 
Registered: Jan 2011
Location: Oslo, Norway
Distribution: Slackware
Posts: 2,557

Rep: Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762
I suspect it is because wget doesn't follow iframes on recursive download. Looking at the page it would seem that all of the pdfs are referenced via iframes.

You can work around this as follows. First create a list of links to all pdfs:

Code:
seq 1 1111 | sed 's,^,http://www.allpianoscores.com/free_scores.php?id=,' | wget -qO- -i- | sed -n 's,.*iframe src="\(scre/.*\.pdf\)" width.*,http://www.allpianoscores.com/\1,p' > pdflist.txt
(Note: I worked the range out by looking at the first link [free_scores.php?id=1] at the top of the "Bach , Johann Sebastian" page and the last link [free_scores.php?id=1111] at the bottom of the "Weber , Carl Maria von" page).

The above command take a while to complete. Once done you can fetch all the pdfs as follows:

Code:
wget -i pdflist.txt

Last edited by ruario; 06-12-2012 at 10:47 PM. Reason: added links; stopped using extended regex so it more closely matches other examples
 
2 members found this post helpful.
Old 06-12-2012, 04:29 PM   #3
mahkoe
LQ Newbie
 
Registered: Oct 2011
Posts: 26

Original Poster
Rep: Reputation: Disabled
Thanks a bunch, this worked! I'm now checking out the man pages for sed so I can replicate this. Big thanks
 
Old 06-12-2012, 04:38 PM   #4
ruario
Senior Member
 
Registered: Jan 2011
Location: Oslo, Norway
Distribution: Slackware
Posts: 2,557

Rep: Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762Reputation: 1762
This might be of interest:

http://www.gnu.org/software/sed/manu...pressions.html
 
1 members found this post helpful.
Old 06-12-2012, 04:40 PM   #5
mahkoe
LQ Newbie
 
Registered: Oct 2011
Posts: 26

Original Poster
Rep: Reputation: Disabled
Awesome, thanks! Now I'm all set.
 
  


Reply

Tags
indexhtml, wget



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] wget failed to download a html page moebus Linux - General 11 01-31-2012 09:58 PM
[SOLVED] How to use wget to download a html book. errigour Linux - Newbie 3 11-02-2011 07:20 AM
redirect index.php to index.html bittus Linux - Software 6 12-14-2009 10:04 PM
apache index.html doesn't show up but index.php do zoffmann Linux - Server 5 01-28-2008 03:53 PM
Where to put index.php (or index.html) on Slackware 11.0 moonguide Slackware 3 05-08-2007 06:35 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 03:32 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration