redirecting from a subdirectory via apache
Hi
I can't get either the ProxyPass or the rewrite commands to fully redirect from a subdirectory of a website, to another website. If anyone can work this out, I would be most grateful. In the example below I am trying to redirect from: http://www.myserver.com.au:2014/office/ to google. The redirection occurs, but an attempt to use google immediately gives a 404 server error 'Not Found...'. Funnily enough, proxying directly from the root directly, ie not the subdirectory, ie from http://www.myserver.com.au:2014/ works in full, ie code: Code:
ProxyPass / http://www.google.com.au/ Code:
ProxyPass /office/ http://www.google.com.au/ File does not exist: /var/www/html/search, File does not exist: /var/www/html/extern_js, (/var/www/html/ being the default apache root directory given that none is specified in my httpd.conf entry (below) and 'search', 'extern_js' etc being files related to google). As indicated below, correctly specifying these directories does not help. Files are still reported as 'missing'. Here is the full code from httpd.conf: Code:
<VirtualHost *:2014> Anyway, that is my problem for the present Compused |
Quote:
Try the following configuration. Code:
<VirtualHost *:2014> There are usually also copyright issues here; reverse-proxying somebody else's content, then editing it, and presenting the end result creates (in most jurisdictions) a derivative work, which is subject (almost everywhere on Earth) to various copyright laws and international agreements. To put it bluntly: Don't do it, unless you know you have the copyrights covered (because the content is yours, for example), or get a written permission. You are likely breaking the law otherwise. Above, the FollowSymlinks option and RewriteEngine are both necessary and required to allow the URL rewriting via mod_rewrite. Within a Directory (or .htaccess) context, one must specify the URL that directory is accessible from, using a RewriteBase directive. This (and any slashes that might follow) is omitted from the URL RewriteRule(s) see, so in a Directory or .htaccess context, the URL that RewriteRule sees will never begin with a slash. Apache eats it. The [P] option for the RewriteRule tells Apache to do a reverse proxy request. It implies [L], since the proxy request is done immediately, so no other rewrite rules are applied. RewriteRule allows much more complex URL mangling, and even setting and modifying GET query variables, so I tend to use it for stuff like this (as opposed to the Proxy directives in general). The ProxyPassReverse stuff should take care of HTTP headers. They will not modify the content, but if the headers contain references to the reverse-proxied URLs, Apache will do its best to fix those. This leaves the content. Unfortunately, we still don't have a good module for this. mod_substitute is usually packaged with Apache nowadays, but it parses on a per-line basis, and will therefore not work for certain content. (Why on earth there is no Substitute module that uses custom record separators? Only applying edits between < and > would work wonderfully for a very large number of pages, with quite modest runtime cost.) The AddOutputFilterByType enables mod_substitute processing for HTML, CSS, and plain text content. The substitution is done at the output phase, i.e. to the content received via the reverse proxy request. Note that if the content is written by a Javascript script, like Google does, you have practically zero chance of properly editing the URLs in the content. The three substitutions are really just examples, as the best working ones depend on the content you reverse proxy -- non-HTML/CSS/XML static content needs no substitutions, for example. The first two substitutions, Code:
"s|^([A-Za-z]+=[\"']?)/+|$1/office/|q" If it matches, the slashes are replaced with /office/. This means that any HTML attribute value that begins with a slash, should be replaced with /office/, with the rest of the attribute intact. The third substitution, Code:
"s|http://www\.w3\.org/TR/CSS2/*|http://www.myserver.com.au/office/|q" If you happen to have some control over the reverse proxied pages, or you know that they are well-formed HTML, you could try using just Code:
Substitute "s|(=\"?)/+|$1/office/|q" Code:
Substitute "s|(=\"?)http://www\.w3\.org/TR/CSS2/*|$1http://www.myserver.com.au/office/|q" The reason why you want to match attributes starting with a slash is simple: they are practically always absolute URLs. Replacing the initial slash with the base URL of the reverse proxy will fix those. Relative URLs, that start with ., .., or a subdirectory or file name, do not need any fixing. They will work as-is. This leaves only URLs that specify a full URL, starting with the scheme. You can just substitute the reverse proxied base URL with the base URL the users see, but that will edit text too. However, you almost never see links in text preceded by =; it is almost exclusively seen in the HTML source code only. If you are unfamiliar with the regular expressions used here (by Substitute and RewriteRule, for example), the Wikipedia article is a good start. The variant used is ERE, or POSIX Extended Regular Expressions. Well, more or less; there may be some implementation peculiarities. I hope this gets you started at least, |
Thanks ++ Animal for all this work
I don't have mod_substitute so will have to go in search of it. I have apache 2.2 Compused |
Quote:
If you need to install an additional module anyway, you might have better luck with mod_proxy_html (version 3.1 or later). There are RPM packages for RHEL6 x86_64, i686 variants, as well as Debian packages. Just make sure you get the 3.1 version, because its configuration has drastically changed compared to the earlier versions. |
You were right Animal, it was there, just not loaded.
I loaded it. I am getting this page (masked): http://www.w3.org/TR/CSS2/ when attempting to load: http://www.myserver.com.au/office: I am not sure if ReverseProxy is what I need...maybe just ProxyPass. Compused |
sorry...I got your drift...http://www.w3.org was just being used as an example!
If I use http://www.google.com.au instead of w3.org, I get a 404 error msg: http://www.myserveer.com.au/office/index.php ... was not found there are no useful error msgs in the error log (on debug) I also tried using the actual destination, ie the one I really want to use, and encountered a similar problem Thanks Compused |
Quote:
The reason is, Google uses Javascript heavily even on its front page. If you look at the actual data your browser receives (for example, using wget -O - 'http://www.myserver.com.au/office/') you can see that the Substitute rules are applied -- but the content in a browser is different! This is because the Javascript modifies the HTML page received from the server. The same applies to any web page which generates some/all of its contents using Javascript: you cannot edit the result. (You could try to edit the Javascript, but .. No, it's just not worth it.) The example I showed you worked for me. Note that it does not alias the entire w3.org site, but just the one specification. If you pick a better site, one that does not rely on Javascript for navigation elements, I can rework the example for you. Note that if you can reverse proxy the site using the same base path -- say, /TR/ for www.w3.org/TR/, or root / for an entire site --, you don't need to rewrite any paths, just the http://hostname/ (and https://hostname/) strings. |
I see...there is no problem redirecting to:
http://www.w3.org/TR/CSS2/ That does work without problems the site I am trying to proxy to (from www.myserver.com.au) is actually of the form: http://www.my2ndserver.com.au:103 and initially you are greeted with a basic apache security login Once past that you find yourself at another login: http://www.my2ndserver.com.au:103/src/login.php However, with the code above I just keep encountering a 'page not found' error. Error logs sometimes state (the equivalent of) 'var/www/src' not found, where I presume 'src' is taken from ...103/src/login.php... Thanks Compused |
Quote:
Quote:
I guess this is partially my fault too, because the examples I showed needs a bit of editing in this case. Try the following configuration; it is almost the same, but the substitutions differ. URL http://www.myserver.com.au/office/ should obtain the same content as URL http://www.my2ndserver.com.au:103/, but with links fixed. Code:
<VirtualHost *:2014> It is impractical to develop rules that catch all cases. They tend to catch content too, and it is hard to maintain too many rules. To develop specific rules, I normally grab a few pages using wget, and check the URLs in them, paying extra attention to the quoting style and whitespace (around the equals sign). If the HTML is well-formed, meaning each name=value attribute pair is on the same line, and each value is in double quotes, something like the three Substitute rules I listed in this message should suffice. Usually, however, HTML is not well-formed, and you need further rules to catch the other cases. Note that the issue is technically not catching the string itself, but catching it in the right places. Browsers are quite relaxed about the HTML they accept, which makes reliably catching the right ones complicated. For example, the http://www.w3.org/TR/html401 page splits some href=URL pairs into two separate lines, which makes editing the URL using mod_substitute very hard, unless you do not mind editing the URL even when it occurs in the content too. |
Thanks again Animal...I get to the apache login screen, but after logging into this, its the same '404 not found' and the apache error message (in log) seen on previous occasions is again there:
File does not exist: /var/www/src And the URL where this msg is seen is: http://www.myserver.com.au:2014/src/login.php Can we doing anything to 'force' this 'src' directory (present on my2ndserver.com.au) to be 'seen' at the proxied (myserver.com.au) site? Just in terms of the rules....I just wanted to check. The first rule, which captures normal urls, is: \.com\.au:103/+!$1$2h and the second one, for those that end with a port #....: m\.au:103([^0-9]|$)!$1$2h ?but both urls have the 103 port # attached....? My only other comments is that should there be any other source port annotation, given that I am actually using myserver.com.au:2014, not just port 80? Rgds again Compfused |
Quote:
If I am right, then you can use a hand-written URL to check. If the 404 error page URL is say http://www.myserver.com.au:2014/foo/bar.page, then try http://www.myserver.com.au:2014/office/foo/bar/page. The latter should load okay. In fact, any URL of form http://www.my2ndserver.com.au:103/something... should work from http://www.myserver.com.au:2014/office/something... except that stylesheets, images, linked scripts et cetera will probably not work (because they have the wrong URLs relative from the http://www.myserver.com.au:2014/office/something... URL). Could you post here, or send me via e-mail, the output of wget -O - http://www.myserver.com.au:2014/src/login.php ? Actually, since only the HTML tags with attributes possibly containing URLs matter, you could filter out most of the uninteresting stuff with e.g. GNU awk: Code:
wget -O - 'http://www.myserver.com.au:2014/src/login.php' | gawk 'BEGIN { IGNORECASE=1 ; RS="<(a|form|input|link|base)[\t\n\v\f\r ][^>]*>" } { out=RT; gsub(/pattern1/, "replacement1", out); gsub(/pattern2/, "replacement2", out); print out }' > source.txt It is very important to retain the exact whitespace, so that the Substitute rules can be properly targeted. So please do not just copy-and-paste the result, or edit it with a text editor that normalizes whitespace, but either attach here, or send me via e-mail, the source.txt file. Quote:
Quote:
The second is needed to catch URLs that contain the full hostname of the other server. Since we do not want links to point to that one, we must change the hostname and initial path to point to the server the users see. The messy thing is that we do only want to match when the port too matches; after all, other ports have other content, and we are not reverse-proxying those. (The trick is with e.g. http://www.my2ndserver.com.au:1034 URLs; we do not want to match those.) The last rule is to catch URLs that are relative to the root of the other server; in that case, we must inject the path to the root where we start proxying, i.e. /office, to the beginning of the path. For example, we want "/src/login.php" to be changed into "/office/src/login.php", because only the latter URL will reverse-proxy the content correctly from the other server. Note: If you were to use the same directory structure, i.e. proxy the entire server -- start at root / and not at /office/, or at least specific sub-trees of it (/src, and so on), using the same paths, you would not need the second type Substitutes at all. A single Code:
Substitute "s!http://www\.my2ndserver\.com\.au:103([^0-9]|$)!http://www\.myserver\.com\.au:2014$1!q" Quote:
I do make errors, and it is very encouraging to see you are looking at my suggestions critically, not just blindly heeding them. |
1 Attachment(s)
Quote:
I had to use: Code:
wget --http-user=myusername --http-password=whatever -O - http://www.myserver.com.au:2014/office/src/login.php Code:
<HTML><HEAD><TITLE>My 2ndserver - Login</TITLE></HEAD> Compfused |
Quote:
Code:
Substitute "s|(=[\"']?)http://www\.my2ndserver\.com\.au:103/+|$1http://www.my2ndserver.com.au:2014/office/|q" Could you please show the output for Code:
wget --post-data=username=username&secretkey=password&just_logged_in=1 --http-user=myusername --http-password=whatever -O - http://www.myserver.com.au:2014/office/src/redirect.php Code:
wget --post-data=username=username&secretkey=password&just_logged_in=1 --http-user=myusername --http-password=whatever -O - http://www.my2ndserver.com.au:103/src/redirect.php
Personally, I suspect both have Location: /src/... (which Apache is unable to fix), or equivalent meta element. If you were to use only the three Substitute rules I list in this post, it should catch even the meta element, as well as all links, and edit them correctly. The Location: header fixing needs mod_headers module, and Code:
Header edit Location: ^http://www\.my2ndserver\.com\.au:103$ http://www.myserver.com.au:2014/office/ You might just try these modifications, and see if it fixes the problem. (It is still important to verify the meta element and the links are correctly edited when retrieved through myserver; if not, then the Substitute rules need further editing.) |
:hattip: Wow. :hattip:
Now that kind of an informative exchange is exactly what we come to LinuxQuestions for. |
Thanks again Animal, your persistence is greatly appreciated. This information doesnt seem to be available anywhere on the net. As a shortcut, I tried adding the
Code:
Header always edit Location: running: Code:
wget --post-data 'username=username&secretkey=password&just_logged_in=1' --http-user=myusername --http-password=whatever -O - http://www.myserver.com.au:2014/office/src/redirect.php Quote:
Code:
wget --post-data 'username=username&secretkey=password&just_logged_in=1' --http-user=myusername --http-password=whatever -O - http://www.my2ndserver.com.au:103/src/redirect.php Quote:
I tried running the above wget command on mindex.php, but there was no output. This is the contents of mindex.php...I feel this file can be problematic at times, at I sometimes get a corrupted screen when using the program with mindex.php listed in the address bar: Code:
cat ./mindex.php Regards Compfused |
All times are GMT -5. The time now is 06:36 PM. |