LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   wget - using --user-agent option still results in 403/forbidden error (https://www.linuxquestions.org/questions/linux-software-2/wget-using-user-agent-option-still-results-in-403-forbidden-error-4175446105/)

tensigh 01-17-2013 08:47 PM

wget - using --user-agent option still results in 403/forbidden error
 
I'm hoping if someone can tell me what I might be doing wrong.

A number of websites have directories I want to download or graphics I want to DL. I can almost always access them through a browser or by directly downloading the file using wget:

wget http://www.somesite.com/dir1/dir2/pic1.jpg.

But if I try to use -r or -m, I get the dreaded "403/Forbidden" error, despite being able to open the file in my browser.

I've tried many combinations of -U options; -U firefox -U Mozilla, -U "Mozilla, platform, blah blah" and they NEVER work.

Is there something else I can do? Most of the time when I Google this issue, the solutions stop with forging a user agent. That never seems to work for me.

What am I doing wrong?

mina86 01-18-2013 06:32 AM

What's the site in question?

Habitual 01-18-2013 09:06 AM

Can you browse to http://www.somesite.com/dir1/dir2/pic1.jpg traditionally?

tensigh 01-19-2013 12:41 AM

I can see the page/graphics through a browser.
 
This is the company I work for, and like I said in my original post, I can get the link through a browser:

http://www.barclayvouchers.co.jp/ima...mainvisual.jpg

Naturally, if I wget the graphic directly, it also works, but that defeats the purpose of using wget (addressing the filenname specifically).

But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.

So the original question remains: am I doing something wrong with wget? I've tried using -m and -U with all types of descriptions after -U and they never work; I always end up at 403/Forbidden.

Habitual 01-19-2013 08:53 AM

Quote:

I can almost always access them through a browser or by directly downloading the file using wget
wrt "almost always"...what do the logs say about these events specifically?

You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)

Again, all 4 details should be in the logs.

I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?
Have you tried from another host to use the dreaded -r or -m options?
wget version? terminal >
Code:

wget --version | head -1
lsb_release -drc

output please. Thanks.

Please let us know.

NyteOwl 01-19-2013 01:33 PM

It is likely the htaccess file is set up to prevent document/image "leeching" by direct download.

tensigh 01-19-2013 04:14 PM

.htaccess file
 
@NyteOwl:

I guess that's what I'm asking; will an .htaccess file block wget:

- even if the user agent is forged, and
- even if the files are accessible through a web browser?

Every time I get a 403 error, both of the above conditions are met.

tensigh 01-19-2013 04:25 PM

Quote:

Originally Posted by Habitual (Post 4873469)
wrt "almost always"...what do the logs say about these events specifically?

It happens with a number of sites and from various machines, so that's why I said "almost always". One time I was able to actually download an entire site when all I wanted was some PDFs, but other than that I get 403/Forbidden errors. :)

Quote:

Originally Posted by Habitual (Post 4873469)
You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)

Why would I find a fail entry in the browser logs, and why would I find a success in the wget logs? It works in a browser and always gets a 403/Forbidden using wget.

Quote:

Originally Posted by Habitual (Post 4873469)
I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?

It is from Apache.
The most recent problem I've had is on my own company's site, but I've had it at other sites, so I don't have access to the .htaccess files. Obviously, the .htaccess files are preventing wget from downloading, I was trying to figure out if it's a limitation of wget or if I wasn't using it correctly.

Quote:

Originally Posted by Habitual (Post 4873469)
Have you tried from another host to use the dreaded -r or -m options?

I have tried from another host, same results.
And I don't dread the -r and -m options, I dread the 403/Forbidden error. :)

Quote:

Originally Posted by Habitual (Post 4873469)
wget version? terminal >
Code:

wget --version | head -1
lsb_release -drc

output please. Thanks.

I can do that once I'm back at work. At home I only have access to wget on *cough* Windows *cough*.

I guess the limitation is that wget is going to be stopped by an .htaccess file, regardless of changing the user-agent.

Habitual 01-19-2013 06:28 PM

For the site(s) that you have control over, you will have references to the 403 "error" and the GET statement from the browser session.
Code:

grep -i mozilla /path/to/httpd.log
or similar.

There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.

Now, as for wget options, this works for me here:
Code:

wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg
wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.

I hope this helps.

tensigh 01-20-2013 12:13 AM

You answered my question, thanks.
 
Quote:

Originally Posted by Habitual (Post 4873781)
There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.

That's why I was confused; I was talking about the client side and you were talking about the server side.

Quote:

Originally Posted by Habitual (Post 4873781)
Now, as for wget options, this works for me here:
Code:

wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg

Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.


Quote:

Originally Posted by Habitual (Post 4873781)
wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.

Okay, that's pretty much what I was looking for. The fact that wget is stopped by a .htaccess file. I don't think it's file permissions since it's available through a browser or specifying directly via wget. Either way, that's my answer; wget is limited.

Quote:

Originally Posted by Habitual (Post 4873781)
I hope this helps.

It did. Thanks for your answers.

Habitual 01-20-2013 09:56 AM

Quote:

I have to specify the file, which defeats the purpose of using wget.
I beg to differ:
Code:

Wget - The non-interactive network downloader.
That implies you know what your are asking the server for and therefore non-interactive.

The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.

Quote:

I don't think it's file permissions since it's available through a browser or specifying directly via wget.
You'd be correct on the perms...Good Eye!

Quote:

Either way, that's my answer; wget is limited.
More like the system admin has done a good job of keeping things tight.

tensigh 01-20-2013 02:09 PM

Good points
 
Quote:

Originally Posted by Habitual (Post 4874115)
I beg to differ:

Sorry, let me clarify; it defeats my purpose for using wget (like I said earlier, if I have to access said files through a
browser, using wget does not save steps.)

Quote:

Originally Posted by Habitual (Post 4874115)
The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.

In this case (and others when I use it) I know what the directories are, I just was hoping to save time over right-click/save as by using wget.


Quote:

Originally Posted by Habitual (Post 4874115)
More like the system admin has done a good job of keeping things tight.

True dat.

mina86 01-21-2013 04:35 AM

Quote:

Originally Posted by tensigh (Post 4873317)
But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.

Name one.

If the problem is referrer, try
Code:

wget -e robots=off --referer=http://example.com/ \
    -U 'Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.12' \
    http://example.com/file

Quote:

Originally Posted by tensigh (Post 4873888)
Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.

If you use -r option you very likely can point to an index file and all linked content will be downloaded.

tensigh 01-21-2013 06:35 AM

One step ahead of you.
 
Quote:

Originally Posted by mina86 (Post 4874569)
Name one.

Already did. But as I said earlier, it happens on multiple sites.


Quote:

Originally Posted by mina86 (Post 4874569)
If the problem is referrer, try
Code:

wget -e robots=off --referer=http://example.com/ \
    -U 'Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.12' \
    http://example.com/file


That's what I was trying to find out; was the problem the referrer, or something else?
Thanks for the tip, tho', I'll try it.

Quote:

Originally Posted by mina86 (Post 4874569)
If you use -r option you very likely can point to an index file and all linked content will be downloaded.

"Very likely" - based on actual experience, or the way it's supposed to work? As I mentioned earlier, I've tried both -r and -m (with -r I usually add -l1 as well) and still get the error.

I did find that by loading a whole boatload of options such as wget -e robots=off -m -r -l3 -np -nd -U Mozilla (site) AND by modifying the .wgetrc file, it will load the entire site, even if you're just hunting for one or two directories. Kind of silly to restrict a directory via an .htaccess file to prevent people from doing mass downloads but downloading the entire site works. :)

Thanks for the tips.

NyteOwl 01-21-2013 12:48 PM

Quote:

Originally Posted by tensigh (Post 4873719)
@NyteOwl:

I guess that's what I'm asking; will an .htaccess file block wget:

- even if the user agent is forged, and
- even if the files are accessible through a web browser?

Every time I get a 403 error, both of the above conditions are met.

It's not a matter of blocking wget per se. It's a matter of how the files are accessed. If they are called in reference to an http request by the server as part of the web page then they can be displayed. You can also prevent them from being displayed if they are not so accessed.

For example:

Code:

RewriteCond %{HTTP_REFERER} ^[http|nttp].*$
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://domain.tld/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://domain.tld$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.domain.tld/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.domain.tld$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|tif|tiff)$ - [F,NC,L]

Any request that doesn't come through domain.tld will return a 403 error.


All times are GMT -5. The time now is 12:39 AM.