LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   wget multiple downloads problem (https://www.linuxquestions.org/questions/linux-newbie-8/wget-multiple-downloads-problem-4175434939/)

agtzim 10-31-2012 01:27 PM

wget multiple downloads problem
 
Hello (first post here :o)

So I was trying to download all the lectures from my concurrent programming class (mpla-server.mpla.com/courses/CE123) with wget and I failed (it only downloaded an index.html)

It's weird cause I managed to do it in an other class page e.x (mpla-server.mpla.com/CE124/lectures). I should point out that the latter had both /lectures.php and lectures/ which gives a dir with all the pdf files . The first page has hrefs with the pdfs pages but when I try wget recursively it doesn't find any pdfs.

10q in advance.
Sorry if it's already answered.

DutchGeek 10-31-2012 05:29 PM

Hi,

Are you saying that the page you are interested in has links to pdf files, but you cannot download them with wget?

try this:
Code:

lynx --dump <website> | awk '/http/{print $2}' | grep .pdf > output.txt
it will fetch all the links on the page that have a pdf extension, and put them in a text file

then you can try:
Code:

for i in $( cat output.txt ); do wget $i; done
this will loop through the text file and download the links, I hope it works for you

agtzim 11-01-2012 06:44 AM

Quote:

Originally Posted by DutchGeek (Post 4819259)
Hi,

Are you saying that the page you are interested in has links to pdf files, but you cannot download them with wget?

try this:
Code:

lynx --dump <website> | awk '/http/{print $2}' | grep .pdf > output.txt
it will fetch all the links on the page that have a pdf extension, and put them in a text file

then you can try:
Code:

for i in $( cat output.txt ); do wget $i; done
this will loop through the text file and download the links, I hope it works for you

Thanx man. The first page has href for the PDFs. The second page has both lectures.php and lectures/. The lectures/ gives a page "index of CE124/lectures" and at the bottom says something about apache.

The q is why at the first page wget can't get the PDFs.

DutchGeek 11-01-2012 04:14 PM

Not sure why,
but maybe because the first page has links to the pdfs (not the actual files), and wget is not configured to follow them. the second page has lectures.php which is what you get when you hit it with a browser probably, but it also has the actual pdf files in that directory.

David the H. 11-03-2012 12:09 PM

Many sites check your browser's user-agent string and/or use cookies in order to block mass downloading programs, and often return a simple index.html instead of the desired file such cases. It's possible to spoof these things, but it can be more complex and site-specific.

You can start by using the -U option to make wget appear to be another browser, at least.


All times are GMT -5. The time now is 12:04 AM.