LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 01-17-2013, 08:47 PM   #1
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Rep: Reputation: 15
wget - using --user-agent option still results in 403/forbidden error


I'm hoping if someone can tell me what I might be doing wrong.

A number of websites have directories I want to download or graphics I want to DL. I can almost always access them through a browser or by directly downloading the file using wget:

wget http://www.somesite.com/dir1/dir2/pic1.jpg.

But if I try to use -r or -m, I get the dreaded "403/Forbidden" error, despite being able to open the file in my browser.

I've tried many combinations of -U options; -U firefox -U Mozilla, -U "Mozilla, platform, blah blah" and they NEVER work.

Is there something else I can do? Most of the time when I Google this issue, the solutions stop with forging a user agent. That never seems to work for me.

What am I doing wrong?
 
Old 01-18-2013, 06:32 AM   #2
mina86
Member
 
Registered: Aug 2008
Distribution: Debian
Posts: 517

Rep: Reputation: 229Reputation: 229Reputation: 229
What's the site in question?
 
Old 01-18-2013, 09:06 AM   #3
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Can you browse to http://www.somesite.com/dir1/dir2/pic1.jpg traditionally?
 
Old 01-19-2013, 12:41 AM   #4
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Original Poster
Rep: Reputation: 15
I can see the page/graphics through a browser.

This is the company I work for, and like I said in my original post, I can get the link through a browser:

http://www.barclayvouchers.co.jp/ima...mainvisual.jpg

Naturally, if I wget the graphic directly, it also works, but that defeats the purpose of using wget (addressing the filenname specifically).

But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.

So the original question remains: am I doing something wrong with wget? I've tried using -m and -U with all types of descriptions after -U and they never work; I always end up at 403/Forbidden.
 
Old 01-19-2013, 08:53 AM   #5
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
I can almost always access them through a browser or by directly downloading the file using wget
wrt "almost always"...what do the logs say about these events specifically?

You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)

Again, all 4 details should be in the logs.

I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?
Have you tried from another host to use the dreaded -r or -m options?
wget version? terminal >
Code:
wget --version | head -1
lsb_release -drc
output please. Thanks.

Please let us know.

Last edited by Habitual; 01-19-2013 at 08:54 AM.
 
Old 01-19-2013, 01:33 PM   #6
NyteOwl
Member
 
Registered: Aug 2008
Location: Nova Scotia, Canada
Distribution: Slackware, OpenBSD, others periodically
Posts: 512

Rep: Reputation: 139Reputation: 139
It is likely the htaccess file is set up to prevent document/image "leeching" by direct download.
 
Old 01-19-2013, 04:14 PM   #7
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Original Poster
Rep: Reputation: 15
.htaccess file

@NyteOwl:

I guess that's what I'm asking; will an .htaccess file block wget:

- even if the user agent is forged, and
- even if the files are accessible through a web browser?

Every time I get a 403 error, both of the above conditions are met.
 
Old 01-19-2013, 04:25 PM   #8
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Habitual View Post
wrt "almost always"...what do the logs say about these events specifically?
It happens with a number of sites and from various machines, so that's why I said "almost always". One time I was able to actually download an entire site when all I wanted was some PDFs, but other than that I get 403/Forbidden errors.

Quote:
Originally Posted by Habitual View Post
You should be able to find 4 events in the logs, 2 for the browser (1 each Success and Fail) and 2 for wget (1 each Success and Fail)
Why would I find a fail entry in the browser logs, and why would I find a success in the wget logs? It works in a browser and always gets a 403/Forbidden using wget.

Quote:
Originally Posted by Habitual View Post
I'm assuming apache error 403 is from apache,so ...
What are the owner:group perms on /path/to/dir1/dir2/pic1.jpg?
What are, if any .htaccess files (or comparable httpd.conf inclusions)?
It is from Apache.
The most recent problem I've had is on my own company's site, but I've had it at other sites, so I don't have access to the .htaccess files. Obviously, the .htaccess files are preventing wget from downloading, I was trying to figure out if it's a limitation of wget or if I wasn't using it correctly.

Quote:
Originally Posted by Habitual View Post
Have you tried from another host to use the dreaded -r or -m options?
I have tried from another host, same results.
And I don't dread the -r and -m options, I dread the 403/Forbidden error.

Quote:
Originally Posted by Habitual View Post
wget version? terminal >
Code:
wget --version | head -1
lsb_release -drc
output please. Thanks.
I can do that once I'm back at work. At home I only have access to wget on *cough* Windows *cough*.

I guess the limitation is that wget is going to be stopped by an .htaccess file, regardless of changing the user-agent.
 
Old 01-19-2013, 06:28 PM   #9
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
For the site(s) that you have control over, you will have references to the 403 "error" and the GET statement from the browser session.
Code:
grep -i mozilla /path/to/httpd.log
or similar.

There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.

Now, as for wget options, this works for me here:
Code:
wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg
wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.

I hope this helps.
 
Old 01-20-2013, 12:13 AM   #10
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Original Poster
Rep: Reputation: 15
You answered my question, thanks.

Quote:
Originally Posted by Habitual View Post
There is no "browser logs" or "wget log".
Both are clients asking apache (server) for the file and hence all requests should be logged in the apache log file.
That's why I was confused; I was talking about the client side and you were talking about the server side.

Quote:
Originally Posted by Habitual View Post
Now, as for wget options, this works for me here:
Code:
wget --random-wait -r -p -e robots=off -U mozilla http://www.barclayvouchers.co.jp/images/index/mainvisual.jpg
Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.


Quote:
Originally Posted by Habitual View Post
wrt: "403/Forbidden", this can happen 2 ways that I know of, the /path/to/dir1/dir2/pic1.jpg has permissions that the apache software/daemon doesn't have access to, OR, the robots.txt and/or .htaccess prevents it.
Okay, that's pretty much what I was looking for. The fact that wget is stopped by a .htaccess file. I don't think it's file permissions since it's available through a browser or specifying directly via wget. Either way, that's my answer; wget is limited.

Quote:
Originally Posted by Habitual View Post
I hope this helps.
It did. Thanks for your answers.
 
Old 01-20-2013, 09:56 AM   #11
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
Quote:
I have to specify the file, which defeats the purpose of using wget.
I beg to differ:
Code:
Wget - The non-interactive network downloader.
That implies you know what your are asking the server for and therefore non-interactive.

The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.

Quote:
I don't think it's file permissions since it's available through a browser or specifying directly via wget.
You'd be correct on the perms...Good Eye!

Quote:
Either way, that's my answer; wget is limited.
More like the system admin has done a good job of keeping things tight.
 
Old 01-20-2013, 02:09 PM   #12
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Original Poster
Rep: Reputation: 15
Good points

Quote:
Originally Posted by Habitual View Post
I beg to differ:
Sorry, let me clarify; it defeats my purpose for using wget (like I said earlier, if I have to access said files through a
browser, using wget does not save steps.)

Quote:
Originally Posted by Habitual View Post
The googletubes are chock-full of people who have (probably, for years) wondered why wget doesn't seem to grab arbitrary directories.
In this case (and others when I use it) I know what the directories are, I just was hoping to save time over right-click/save as by using wget.


Quote:
Originally Posted by Habitual View Post
More like the system admin has done a good job of keeping things tight.
True dat.
 
Old 01-21-2013, 04:35 AM   #13
mina86
Member
 
Registered: Aug 2008
Distribution: Debian
Posts: 517

Rep: Reputation: 229Reputation: 229Reputation: 229
Quote:
Originally Posted by tensigh View Post
But that isn't the question - this happens to me on a LOT of websites. Not only my own company's website, but MOST of the time I get a 403/Forbidden error.
Name one.

If the problem is referrer, try
Code:
wget -e robots=off --referer=http://example.com/ \
    -U 'Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.12' \
    http://example.com/file
Quote:
Originally Posted by tensigh View Post
Yeah, that works for me too, but the original problem remains. I can't just use the directory and get the files, I have to specify the file, which defeats the purpose of using wget.
If you use -r option you very likely can point to an index file and all linked content will be downloaded.

Last edited by mina86; 01-21-2013 at 04:38 AM.
 
Old 01-21-2013, 06:35 AM   #14
tensigh
Member
 
Registered: Mar 2004
Location: Tokyo, Japan
Distribution: Backtrack 5 R3
Posts: 145

Original Poster
Rep: Reputation: 15
One step ahead of you.

Quote:
Originally Posted by mina86 View Post
Name one.
Already did. But as I said earlier, it happens on multiple sites.


Quote:
Originally Posted by mina86 View Post
If the problem is referrer, try
Code:
wget -e robots=off --referer=http://example.com/ \
    -U 'Opera/9.80 (X11; Linux x86_64) Presto/2.12.388 Version/12.12' \
    http://example.com/file
That's what I was trying to find out; was the problem the referrer, or something else?
Thanks for the tip, tho', I'll try it.

Quote:
Originally Posted by mina86 View Post
If you use -r option you very likely can point to an index file and all linked content will be downloaded.
"Very likely" - based on actual experience, or the way it's supposed to work? As I mentioned earlier, I've tried both -r and -m (with -r I usually add -l1 as well) and still get the error.

I did find that by loading a whole boatload of options such as wget -e robots=off -m -r -l3 -np -nd -U Mozilla (site) AND by modifying the .wgetrc file, it will load the entire site, even if you're just hunting for one or two directories. Kind of silly to restrict a directory via an .htaccess file to prevent people from doing mass downloads but downloading the entire site works.

Thanks for the tips.
 
Old 01-21-2013, 12:48 PM   #15
NyteOwl
Member
 
Registered: Aug 2008
Location: Nova Scotia, Canada
Distribution: Slackware, OpenBSD, others periodically
Posts: 512

Rep: Reputation: 139Reputation: 139
Quote:
Originally Posted by tensigh View Post
@NyteOwl:

I guess that's what I'm asking; will an .htaccess file block wget:

- even if the user agent is forged, and
- even if the files are accessible through a web browser?

Every time I get a 403 error, both of the above conditions are met.
It's not a matter of blocking wget per se. It's a matter of how the files are accessed. If they are called in reference to an http request by the server as part of the web page then they can be displayed. You can also prevent them from being displayed if they are not so accessed.

For example:

Code:
RewriteCond %{HTTP_REFERER} ^[http|nttp].*$
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://domain.tld/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://domain.tld$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.domain.tld/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.domain.tld$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|tif|tiff)$ - [F,NC,L]
Any request that doesn't come through domain.tld will return a 403 error.
 
  


Reply

Tags
download, downloading, wget



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
wget ERROR 403: Forbidden markm0705 Linux - Newbie 3 05-08-2011 05:47 PM
LXer: Using Wget's User Agent Option Safely On Linux And Unix LXer Syndicated Linux News 0 01-06-2009 09:40 AM
403 forbidden error on /~user/ index.php wisdom Linux - Software 10 11-17-2006 06:38 PM
Apache 403 Forbidden: where is my 'www' user? guarriman Linux - General 1 01-21-2005 02:57 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 10:44 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration