wget - using --user-agent option still results in 403/forbidden error
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
wget - using --user-agent option still results in 403/forbidden error
I'm hoping if someone can tell me what I might be doing wrong.
A number of websites have directories I want to download or graphics I want to DL. I can almost always access them through a browser or by directly downloading the file using wget:
But if I try to use -r or -m, I get the dreaded "403/Forbidden" error, despite being able to open the file in my browser.
I've tried many combinations of -U options; -U firefox -U Mozilla, -U "Mozilla, platform, blah blah" and they NEVER work.
Is there something else I can do? Most of the time when I Google this issue, the solutions stop with forging a user agent. That never seems to work for me.
Naturally, if I wget the graphic directly, it also works, but that defeats the purpose of using wget (addressing the filenname specifically).
But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.
So the original question remains: am I doing something wrong with wget? I've tried using -m and -U with all types of descriptions after -U and they never work; I always end up at 403/Forbidden.
I can almost always access them through a browser or by directly downloading the file using wget
wrt "almost always"...what do the logs say about these events specifically?
You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)
Again, all 4 details should be in the logs.
I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?
Have you tried from another host to use the dreaded -r or -m options?
wget version? terminal >
wrt "almost always"...what do the logs say about these events specifically?
It happens with a number of sites and from various machines, so that's why I said "almost always". One time I was able to actually download an entire site when all I wanted was some PDFs, but other than that I get 403/Forbidden errors.
Quote:
Originally Posted by Habitual
You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)
Why would I find a fail entry in the browser logs, and why would I find a success in the wget logs? It works in a browser and always gets a 403/Forbidden using wget.
Quote:
Originally Posted by Habitual
I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?
It is from Apache.
The most recent problem I've had is on my own company's site, but I've had it at other sites, so I don't have access to the .htaccess files. Obviously, the .htaccess files are preventing wget from downloading, I was trying to figure out if it's a limitation of wget or if I wasn't using it correctly.
Quote:
Originally Posted by Habitual
Have you tried from another host to use the dreaded -r or -m options?
I have tried from another host, same results.
And I don't dread the -r and -m options, I dread the 403/Forbidden error.
Quote:
Originally Posted by Habitual
wget version? terminal >
Code:
wget --version | head -1
lsb_release -drc
output please. Thanks.
I can do that once I'm back at work. At home I only have access to wget on *cough* Windows *cough*.
I guess the limitation is that wget is going to be stopped by an .htaccess file, regardless of changing the user-agent.
For the site(s) that you have control over, you will have references to the 403 "error" and the GET statement from the browser session.
Code:
grep -i mozilla /path/to/httpd.log
or similar.
There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.
Now, as for wget options, this works for me here:
Code:
wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg
wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.
There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.
That's why I was confused; I was talking about the client side and you were talking about the server side.
Quote:
Originally Posted by Habitual
Now, as for wget options, this works for me here:
Code:
wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg
Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.
Quote:
Originally Posted by Habitual
wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.
Okay, that's pretty much what I was looking for. The fact that wget is stopped by a .htaccess file. I don't think it's file permissions since it's available through a browser or specifying directly via wget. Either way, that's my answer; wget is limited.
Sorry, let me clarify; it defeats my purpose for using wget (like I said earlier, if I have to access said files through a
browser, using wget does not save steps.)
Quote:
Originally Posted by Habitual
The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.
In this case (and others when I use it) I know what the directories are, I just was hoping to save time over right-click/save as by using wget.
Quote:
Originally Posted by Habitual
More like the system admin has done a good job of keeping things tight.
But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.
Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.
If you use -r option you very likely can point to an index file and all linked content will be downloaded.
That's what I was trying to find out; was the problem the referrer, or something else?
Thanks for the tip, tho', I'll try it.
Quote:
Originally Posted by mina86
If you use -r option you very likely can point to an index file and all linked content will be downloaded.
"Very likely" - based on actual experience, or the way it's supposed to work? As I mentioned earlier, I've tried both -r and -m (with -r I usually add -l1 as well) and still get the error.
I did find that by loading a whole boatload of options such as wget -e robots=off -m -r -l3 -np -nd -U Mozilla (site) AND by modifying the .wgetrc file, it will load the entire site, even if you're just hunting for one or two directories. Kind of silly to restrict a directory via an .htaccess file to prevent people from doing mass downloads but downloading the entire site works.
I guess that's what I'm asking; will an .htaccess file block wget:
- even if the user agent is forged, and
- even if the files are accessible through a web browser?
Every time I get a 403 error, both of the above conditions are met.
It's not a matter of blocking wget per se. It's a matter of how the files are accessed. If they are called in reference to an http request by the server as part of the web page then they can be displayed. You can also prevent them from being displayed if they are not so accessed.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.