Thought I'd share a routing problem (and a resolution) we had with Oracle Exadata. For those that aren't familar with it, the Exadata is an Oracle "managed" solution. Basically, a 1/2 rack of storage and linux servers optimized to run oracle database. Ours uses their unbreakable kernel.
After a software install, we were unable to connect to the db on this server. Connections are made to a vip which point to a bonded interface on the server. At first, none of the vip addresses worked but the direct ip did. After a reboot of the servers, even that stopped working.
Tcpdump showed traffic reaching the interface but no response. No denial, just... dropped. After more troubleshooting we discovered a new error in the message log:
messages.1:Jan 9 16:59:14 exadb01 kernel: martian source 10.22.102.24 from 172.22.22.90, on dev bondeth0
Turns out that "martian source" is a cute way of saying the packet is unrouteable. Interesting. SO, the packets from the .90 server are hitting the primary ip (.24) and being considered unrouteable. Why? The default routes look good.
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.22.100.0 * 255.255.254.0 U 0 0 0 eth0
10.22.102.0 * 255.255.254.0 U 0 0 0 bondeth0
172.22.10.0 * 255.255.254.0 U 0 0 0 eth2
default 10.22.102.1 0.0.0.0 UG 0 0 0 bondeth0
The answer is that Oracle enables policy based routing (pbr) by default on their kernels. Even if you don't make use of it, it must be properly configured. PBR requires a rule-* and a route-* file per interface. When oracle set up the server they did that. However, the interfaces were changed and added to over time. No one was aware of the additional configuration requirements.
The key point here is that it 'may' work for awhile without the rule/route files but you'll never know when it will fail. And it took us 3 days and the 'right' Oracle tech to fix. Once we added the files and restarted networking, it worked perfectly.
Here are some example files. Note real interfaces start with table 220 and bonded ones with 210 by convention. The files basically specify that traffic go back out whatever interface it came in on.
For more info, see http://www.pythian.com/news/36747/ad...olicy-routing/
Good luck and I hope this helps someone.
# cat rule-bondeth0
from 10.22.102.0/23 table 210
to 10.22.102.0/23 table 210
# cat route-bondeth0
10.22.102.0/23 dev bondeth0 table 210
default via 10.22.102.1 dev bondeth0 table 210
# cat route-eth0
10.22.100.0/23 dev eth0 table 220
default via 10.22.100.1 dev eth0 table 220
# cat rule-eth0
from 10.22.100.33 table 220
to 10.22.100.33 table 220