wget - using --user-agent option still results in 403/forbidden error

tensigh · 01-17-2013, 08:47 PM

I'm hoping if someone can tell me what I might be doing wrong.

A number of websites have directories I want to download or graphics I want to DL. I can almost always access them through a browser or by directly downloading the file using wget:

wget http://www.somesite.com/dir1/dir2/pic1.jpg.

But if I try to use -r or -m, I get the dreaded "403/Forbidden" error, despite being able to open the file in my browser.

I've tried many combinations of -U options; -U firefox -U Mozilla, -U "Mozilla, platform, blah blah" and they NEVER work.

Is there something else I can do? Most of the time when I Google this issue, the solutions stop with forging a user agent. That never seems to work for me.

What am I doing wrong?

mina86 · 01-18-2013, 06:32 AM

What's the site in question?

Habitual · 01-18-2013, 09:06 AM

Can you browse to http://www.somesite.com/dir1/dir2/pic1.jpg traditionally?

tensigh · 01-19-2013, 12:41 AM

This is the company I work for, and like I said in my original post, I can get the link through a browser:

http://www.barclayvouchers.co.jp/ima...mainvisual.jpg

Naturally, if I wget the graphic directly, it also works, but that defeats the purpose of using wget (addressing the filenname specifically).

But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.

So the original question remains: am I doing something wrong with wget? I've tried using -m and -U with all types of descriptions after -U and they never work; I always end up at 403/Forbidden.

Habitual · 01-19-2013, 08:53 AM

Quote:

I can almost always access them through a browser or by directly downloading the file using wget

wrt "almost always"...what do the logs say about these events specifically?

You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)

Again, all 4 details should be in the logs.

I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?
Have you tried from another host to use the dreaded -r or -m options?
wget version? terminal >

Code:

wget --version | head -1
lsb_release -drc

output please. Thanks.

Please let us know.

NyteOwl · 01-19-2013, 01:33 PM

It is likely the htaccess file is set up to prevent document/image "leeching" by direct download.

tensigh · 01-19-2013, 04:14 PM

@NyteOwl:

I guess that's what I'm asking; will an .htaccess file block wget:

- even if the user agent is forged, and
- even if the files are accessible through a web browser?

Every time I get a 403 error, both of the above conditions are met.

tensigh · 01-19-2013, 04:25 PM

Quote:

Originally Posted by Habitual

wrt "almost always"...what do the logs say about these events specifically?

It happens with a number of sites and from various machines, so that's why I said "almost always". One time I was able to actually download an entire site when all I wanted was some PDFs, but other than that I get 403/Forbidden errors.

Quote:

Originally Posted by Habitual

You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)

Why would I find a fail entry in the browser logs, and why would I find a success in the wget logs? It works in a browser and always gets a 403/Forbidden using wget.

Quote:

Originally Posted by Habitual

I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?

It is from Apache.
The most recent problem I've had is on my own company's site, but I've had it at other sites, so I don't have access to the .htaccess files. Obviously, the .htaccess files are preventing wget from downloading, I was trying to figure out if it's a limitation of wget or if I wasn't using it correctly.

Quote:

Originally Posted by Habitual

Have you tried from another host to use the dreaded -r or -m options?

I have tried from another host, same results.
And I don't dread the -r and -m options, I dread the 403/Forbidden error.

Quote:

Originally Posted by Habitual

wget version? terminal >

Code:

wget --version | head -1
lsb_release -drc

output please. Thanks.

I can do that once I'm back at work. At home I only have access to wget on *cough* Windows *cough*.

I guess the limitation is that wget is going to be stopped by an .htaccess file, regardless of changing the user-agent.

Habitual · 01-19-2013, 06:28 PM

For the site(s) that you have control over, you will have references to the 403 "error" and the GET statement from the browser session.

Code:

grep -i mozilla /path/to/httpd.log

or similar.

There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.

Now, as for wget options, this works for me here:

Code:

wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg

wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.

I hope this helps.

tensigh · 01-20-2013, 12:13 AM

Quote:

Originally Posted by Habitual

There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.

That's why I was confused; I was talking about the client side and you were talking about the server side.

Quote:

Originally Posted by Habitual

Now, as for wget options, this works for me here:

Code:

wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg

Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.

Quote:

Originally Posted by Habitual

wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.

Okay, that's pretty much what I was looking for. The fact that wget is stopped by a .htaccess file. I don't think it's file permissions since it's available through a browser or specifying directly via wget. Either way, that's my answer; wget is limited.

Quote:

Originally Posted by Habitual

I hope this helps.

It did. Thanks for your answers.

Habitual · 01-20-2013, 09:56 AM

Quote:

I have to specify the file, which defeats the purpose of using wget.

I beg to differ:

Code:

Wget - The non-interactive network downloader.

That implies you know what your are asking the server for and therefore non-interactive.

The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.

Quote:

I don't think it's file permissions since it's available through a browser or specifying directly via wget.

You'd be correct on the perms...Good Eye!

Quote:

Either way, that's my answer; wget is limited.

More like the system admin has done a good job of keeping things tight.

tensigh · 01-20-2013, 02:09 PM

Quote:

Originally Posted by Habitual

I beg to differ:

Sorry, let me clarify; it defeats my purpose for using wget (like I said earlier, if I have to access said files through a
browser, using wget does not save steps.)

Quote:

Originally Posted by Habitual

The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.

In this case (and others when I use it) I know what the directories are, I just was hoping to save time over right-click/save as by using wget.

Quote:

Originally Posted by Habitual

More like the system admin has done a good job of keeping things tight.

True dat.

mina86 · 01-21-2013, 04:35 AM

Quote:

Originally Posted by tensigh

But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.

Name one.

If the problem is referrer, try

Code:

wget -e robots=off --referer=http://example.com/ \
    -U 'Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.12' \
    http://example.com/file

Quote:

Originally Posted by tensigh

Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.

If you use -r option you very likely can point to an index file and all linked content will be downloaded.

tensigh · 01-21-2013, 06:35 AM

Quote:

Originally Posted by mina86

Name one.

Already did. But as I said earlier, it happens on multiple sites.

Quote:

Originally Posted by mina86

If the problem is referrer, try

Code:

wget -e robots=off --referer=http://example.com/ \
    -U 'Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.12' \
    http://example.com/file

That's what I was trying to find out; was the problem the referrer, or something else?
Thanks for the tip, tho', I'll try it.

Quote:

Originally Posted by mina86

If you use -r option you very likely can point to an index file and all linked content will be downloaded.

"Very likely" - based on actual experience, or the way it's supposed to work? As I mentioned earlier, I've tried both -r and -m (with -r I usually add -l1 as well) and still get the error.

I did find that by loading a whole boatload of options such as wget -e robots=off -m -r -l3 -np -nd -U Mozilla (site) AND by modifying the .wgetrc file, it will load the entire site, even if you're just hunting for one or two directories. Kind of silly to restrict a directory via an .htaccess file to prevent people from doing mass downloads but downloading the entire site works.

Thanks for the tips.

NyteOwl · 01-21-2013, 12:48 PM

Quote:

Originally Posted by tensigh

@NyteOwl:

I guess that's what I'm asking; will an .htaccess file block wget:

- even if the user agent is forged, and
- even if the files are accessible through a web browser?

Every time I get a 403 error, both of the above conditions are met.

It's not a matter of blocking wget per se. It's a matter of how the files are accessed. If they are called in reference to an http request by the server as part of the web page then they can be displayed. You can also prevent them from being displayed if they are not so accessed.

For example:

Code:

RewriteCond %{HTTP_REFERER} ^[http|nttp].*$
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://domain.tld/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://domain.tld$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.domain.tld/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.domain.tld$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|tif|tiff)$ - [F,NC,L]

Any request that doesn't come through domain.tld will return a 403 error.