wget; download certain link

Si14 · 04-06-2013, 03:10 AM

I want to download a pdf file. The page URL is:

Code:

www.amazon.com/product1/pdf

In the source file of the above page, there are several links (1 PDF+JPG files+js files...).

1- I want to download only the pdf file. The pdf file's link is as below (extracted from the above HTML source code):

Code:

http://book.amazon.com/still/10.1202/shelf.201205216/seed/1811_ftp.pdf?v=1&t=hf6hlzrm&fb62f105

It seems that "ftp.pdf" can be used to filter the above PDF link for wget.

2- I want to save the output file to 1811_ftp.pdf
That means whatever is after "seed/" and before "?v=" strings.

Thank you for your help.

unSpawn · 04-06-2013, 03:43 AM

In theory something along the lines of a

Code:

curl -s "http://book.amazon.com/still/10.1202/shelf.201205216/seed/1811_ftp.pdf?v=1&t=hf6hlzrm&fb62f105" > "~/1811_ftp.pdf"

should work except that 0) the host name AFAIK is "books" and not "book" and 1) unless you somehow make the D/L command part of an existing session or supply the right credentials to log in first (if applicable) it may redirect to another page denying you the D/L.

Si14 · 04-07-2013, 09:15 PM

Quote:

Originally Posted by unSpawn

In theory something along the lines of a

Code:

curl -s "http://book.amazon.com/still/10.1202/shelf.201205216/seed/1811_ftp.pdf?v=1&t=hf6hlzrm&fb62f105" > "~/1811_ftp.pdf"

should work except that 0) the host name AFAIK is "books" and not "book" and 1) unless you somehow make the D/L command part of an existing session or supply the right credentials to log in first (if applicable) it may redirect to another page denying you the D/L.

I can not use the code you mentioned.
1- I need to use the following link in the curl command:

Code:

www.amazon.com/product1/pdf

and
2- I need to tell curl to only download the file which has "ftp.pdf" string.
and
3- then save the output file into whatever is after "seed/" and before "?v=" strings.

I have the following links saved in list.txt, which need to go through the above processes. I need to tell curl to perform the above actions for each of them.

Code:

list.txt:
www.amazon.com/product1/pdf
www.amazon.com/product2/pdf
www.amazon.com/product3/pdf
www.amazon.com/product4/pdf
...

Thank you for your help.

unSpawn · 04-08-2013, 04:48 PM

*shrug* When I try your URI I get We're sorry. The Web address you entered is not a functioning page on our site...

David the H. · 04-09-2013, 04:22 PM

Assuming the url is stored in a variable, all it takes is a simple parameter substitution or similar string manipulation.

Code:

url='http://book.amazon.com/still/10.1202/shelf.201205216/seed/1811_ftp.pdf?v=1&t=hf6hlzrm&fb62f105'

fname=${url##*/}
fname=${fname%%[?]*}

wget -O "$fname" "$url"