wget - using --user-agent option still results in 403/forbidden error
I'm hoping if someone can tell me what I might be doing wrong.
A number of websites have directories I want to download or graphics I want to DL. I can almost always access them through a browser or by directly downloading the file using wget: wget http://www.somesite.com/dir1/dir2/pic1.jpg. But if I try to use -r or -m, I get the dreaded "403/Forbidden" error, despite being able to open the file in my browser. I've tried many combinations of -U options; -U firefox -U Mozilla, -U "Mozilla, platform, blah blah" and they NEVER work. Is there something else I can do? Most of the time when I Google this issue, the solutions stop with forging a user agent. That never seems to work for me. What am I doing wrong? |
What's the site in question?
|
Can you browse to http://www.somesite.com/dir1/dir2/pic1.jpg traditionally?
|
I can see the page/graphics through a browser.
This is the company I work for, and like I said in my original post, I can get the link through a browser:
http://www.barclayvouchers.co.jp/ima...mainvisual.jpg Naturally, if I wget the graphic directly, it also works, but that defeats the purpose of using wget (addressing the filenname specifically). But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error. So the original question remains: am I doing something wrong with wget? I've tried using -m and -U with all types of descriptions after -U and they never work; I always end up at 403/Forbidden. |
Quote:
You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail) Again, all 4 details should be in the logs. I'm assuming apache error 403 is from apache,so ... What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg? What are, if any .htaccess files (or comparable httpd.conf inclusions)? Have you tried from another host to use the dreaded -r or -m options? wget version? terminal > Code:
wget --version | head -1 Please let us know. |
It is likely the htaccess file is set up to prevent document/image "leeching" by direct download.
|
.htaccess file
@NyteOwl:
I guess that's what I'm asking; will an .htaccess file block wget: - even if the user agent is forged, and - even if the files are accessible through a web browser? Every time I get a 403 error, both of the above conditions are met. |
Quote:
Quote:
Quote:
The most recent problem I've had is on my own company's site, but I've had it at other sites, so I don't have access to the .htaccess files. Obviously, the .htaccess files are preventing wget from downloading, I was trying to figure out if it's a limitation of wget or if I wasn't using it correctly. Quote:
And I don't dread the -r and -m options, I dread the 403/Forbidden error. :) Quote:
I guess the limitation is that wget is going to be stopped by an .htaccess file, regardless of changing the user-agent. |
For the site(s) that you have control over, you will have references to the 403 "error" and the GET statement from the browser session.
Code:
grep -i mozilla /path/to/httpd.log There is no "browser logs" or "wget log". Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file. Now, as for wget options, this works for me here: Code:
wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg I hope this helps. |
You answered my question, thanks.
Quote:
Quote:
Quote:
Quote:
|
Quote:
Code:
Wget - The non-interactive network downloader. The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories. Quote:
Quote:
|
Good points
Quote:
browser, using wget does not save steps.) Quote:
Quote:
|
Quote:
If the problem is referrer, try Code:
wget -e robots=off --referer=http://example.com/ \ Quote:
|
One step ahead of you.
Quote:
Quote:
Thanks for the tip, tho', I'll try it. Quote:
I did find that by loading a whole boatload of options such as wget -e robots=off -m -r -l3 -np -nd -U Mozilla (site) AND by modifying the .wgetrc file, it will load the entire site, even if you're just hunting for one or two directories. Kind of silly to restrict a directory via an .htaccess file to prevent people from doing mass downloads but downloading the entire site works. :) Thanks for the tips. |
Quote:
For example: Code:
RewriteCond %{HTTP_REFERER} ^[http|nttp].*$ |
All times are GMT -5. The time now is 12:39 AM. |