LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Networking (http://www.linuxquestions.org/questions/linux-networking-3/)
-   -   Routing Problems w/ Oracle linux-Exadata (Solved) (http://www.linuxquestions.org/questions/linux-networking-3/routing-problems-w-oracle-linux-exadata-solved-4175445695/)

treadwm 01-15-2013 09:26 AM

Routing Problems w/ Oracle linux-Exadata (Solved)
 
Thought I'd share a routing problem (and a resolution) we had with Oracle Exadata. For those that aren't familar with it, the Exadata is an Oracle "managed" solution. Basically, a 1/2 rack of storage and linux servers optimized to run oracle database. Ours uses their unbreakable kernel.

After a software install, we were unable to connect to the db on this server. Connections are made to a vip which point to a bonded interface on the server. At first, none of the vip addresses worked but the direct ip did. After a reboot of the servers, even that stopped working.

Tcpdump showed traffic reaching the interface but no response. No denial, just... dropped. After more troubleshooting we discovered a new error in the message log:

messages.1:Jan 9 16:59:14 exadb01 kernel: martian source 10.22.102.24 from 172.22.22.90, on dev bondeth0

Turns out that "martian source" is a cute way of saying the packet is unrouteable. Interesting. SO, the packets from the .90 server are hitting the primary ip (.24) and being considered unrouteable. Why? The default routes look good.

# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.22.100.0 * 255.255.254.0 U 0 0 0 eth0
10.22.102.0 * 255.255.254.0 U 0 0 0 bondeth0
172.22.10.0 * 255.255.254.0 U 0 0 0 eth2
default 10.22.102.1 0.0.0.0 UG 0 0 0 bondeth0

The answer is that Oracle enables policy based routing (pbr) by default on their kernels. Even if you don't make use of it, it must be properly configured. PBR requires a rule-* and a route-* file per interface. When oracle set up the server they did that. However, the interfaces were changed and added to over time. No one was aware of the additional configuration requirements.

The key point here is that it 'may' work for awhile without the rule/route files but you'll never know when it will fail. And it took us 3 days and the 'right' Oracle tech to fix. Once we added the files and restarted networking, it worked perfectly.

Here are some example files. Note real interfaces start with table 220 and bonded ones with 210 by convention. The files basically specify that traffic go back out whatever interface it came in on.
For more info, see http://www.pythian.com/news/36747/ad...olicy-routing/

Good luck and I hope this helps someone.

# cat rule-bondeth0
from 10.22.102.0/23 table 210
to 10.22.102.0/23 table 210

# cat route-bondeth0
10.22.102.0/23 dev bondeth0 table 210
default via 10.22.102.1 dev bondeth0 table 210

# cat route-eth0
10.22.100.0/23 dev eth0 table 220
default via 10.22.100.1 dev eth0 table 220

# cat rule-eth0
from 10.22.100.33 table 220
to 10.22.100.33 table 220

TB0ne 01-15-2013 09:52 AM

Outstanding write-up, thank you for posting it.

treadwm 01-15-2013 09:57 AM

Quote:

Originally Posted by TB0ne (Post 4870579)
Outstanding write-up, thank you for posting it.

No problem. It was one of those issues, that search turned up many similar questions and not many answers.

LSeley 02-20-2013 01:28 PM

Thank you! We encountered this when we upgraded to 11.2.3.2.0/11.2.0.3. I stumbled across your post, and saved it, hours before we found the "martian source" messages so once we did we had a ready fix (as opposed to Oracle which still hasn't given us an RCA over a month later). My search was for “oracle Exadata vip troubleshooting”, how lucky was that?

Our symptoms: Everything upgraded just fine, crsctl reported that everything was up and running, but we couldn't connect to the databases from any other servers. tnspings from the other servers would either time out with TNS-12535 or ORA-12170. pings to the IPs and the scan listener always got “100% packet loss”.

This Exadata had been installed and configured by Oracle ACS, no changes had been made to the configuration after the initial install. We’d done some minor patching a year ago, no problems, and have rebooted everything a couple of times, again no problems. We started out on 11.2.2.3.2 so based on your post, and Marc Fielding’s blog, I would guess it was a configuration problem. It’s interesting that it took us 1.5 years to encounter the issue.


All times are GMT -5. The time now is 05:30 AM.