[SOLVED] wget multiple downloads problem

agtzim · 10-31-2012, 01:27 PM

Hello (first post here

)

So I was trying to download all the lectures from my concurrent programming class (mpla-server.mpla.com/courses/CE123) with wget and I failed (it only downloaded an index.html)

It's weird cause I managed to do it in an other class page e.x (mpla-server.mpla.com/CE124/lectures). I should point out that the latter had both /lectures.php and lectures/ which gives a dir with all the pdf files . The first page has hrefs with the pdfs pages but when I try wget recursively it doesn't find any pdfs.

10q in advance.
Sorry if it's already answered.

DutchGeek · 10-31-2012, 05:29 PM

Hi,

Are you saying that the page you are interested in has links to pdf files, but you cannot download them with wget?

try this:

Code:

lynx --dump <website> | awk '/http/{print $2}' | grep .pdf > output.txt

it will fetch all the links on the page that have a pdf extension, and put them in a text file

then you can try:

Code:

for i in $( cat output.txt ); do wget $i; done

this will loop through the text file and download the links, I hope it works for you

agtzim · 11-01-2012, 06:44 AM

Quote:

Originally Posted by DutchGeek

Hi,

Are you saying that the page you are interested in has links to pdf files, but you cannot download them with wget?

try this:

Code:

lynx --dump <website> | awk '/http/{print $2}' | grep .pdf > output.txt

it will fetch all the links on the page that have a pdf extension, and put them in a text file

then you can try:

Code:

for i in $( cat output.txt ); do wget $i; done

this will loop through the text file and download the links, I hope it works for you

Thanx man. The first page has href for the PDFs. The second page has both lectures.php and lectures/. The lectures/ gives a page "index of CE124/lectures" and at the bottom says something about apache.

The q is why at the first page wget can't get the PDFs.

DutchGeek · 11-01-2012, 04:14 PM

Not sure why,
but maybe because the first page has links to the pdfs (not the actual files), and wget is not configured to follow them. the second page has lectures.php which is what you get when you hit it with a browser probably, but it also has the actual pdf files in that directory.

David the H. · 11-03-2012, 12:09 PM

Many sites check your browser's user-agent string and/or use cookies in order to block mass downloading programs, and often return a simple index.html instead of the desired file such cases. It's possible to spoof these things, but it can be more complex and site-specific.

You can start by using the -U option to make wget appear to be another browser, at least.