LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   wget fail to download pdf files (https://www.linuxquestions.org/questions/linux-software-2/wget-fail-to-download-pdf-files-441643/)

powah 05-04-2006 02:19 PM

wget fail to download pdf files
 
I want to download all the pdf files at the web site http://www.advancedlinuxprogramming.com/alp-folder

There are about 20 pdf files so I want to use wget to download them.
However, I do not figure out the correct way to do that.
I tried these but all failed:
$ wget -r -l1 --no-parent -A.pdf http://www.advancedlinuxprogramming.com/alp-folder

$ wget -r --no-parent -A.pdf http://www.advancedlinuxprogramming.com/alp-folder

$ wget --convert-links -r -A pdf http://www.advancedlinuxprogramming.com/alp-folder/

$ wget --convert-links -r -A "*.pdf" http://www.advancedlinuxprogramming.com/alp-folder/

$ wget --version
GNU Wget 1.9+cvs-stable (Red Hat modified)

Copyright (C) 2003 Free Software Foundation, Inc.

I use FC3 linux.

jschiwal 05-04-2006 03:03 PM

The robots.txt file doesn't allow it.

You could save that webpage ( Downloading... ) in your browser and extract the locations of each listed pdf file from the .html file you saved (try sed for this). Then you could use curl -O in a "for" loop to download each file in your list.

powah 05-04-2006 03:38 PM

I discover that "wget -erobots=off" will make Wget ignore the robots.txt file
i.e. this will download all pdf files:
wget --convert-links -r -A "*.pdf" -erobots=off http://www.advancedlinuxprogramming.c

Problem is solved.
Thanks!

Quote:

Originally Posted by jschiwal
The robots.txt file doesn't allow it.

You could save that webpage ( Downloading... ) in your browser and extract the locations of each listed pdf file from the .html file you saved (try sed for this). Then you could use curl -O in a "for" loop to download each file in your list.



All times are GMT -5. The time now is 08:34 AM.