Oracle CRS reboots

srinishrews · 05-20-2015, 10:48 AM

Hi,
I am running a 6 node 11g RAC cluster with cluster ready services on linux. All the hosts are Virtual machines running RHEL6.6 on ESX 5.5 with dedicated heartbeat for interconnect with Jumbo frames enabled.

The issue that I'm seeing is one or two nodes gets rebooted when CRS can't communicate with the other nodes. I see the below message on /var/log/messages

exec /apps/crs/GRID/11203/perl/bin/perl -I/apps/crs/GRID/11203/perl/lib /apps/crs/GRID/11203/bin/crswrapexece.pl /apps/crs/GRID/11203/crs/install/s_crsconfig_test01_env.txt /apps/crs/GRID/11203/bin/ohasd.bin "reboot"

I checked, the storage IO and cant find any high utilization, it's running at less than 10% all the time. network is 10G, and is not showing errors on the switch. Memory usage(at 60% on an average) and CPU util(less tnan 40%) is normal.

Can somone suggest me what other information would be beneficial to check? I am seeing nothing in the logs as to any errors, or waits for disk writes. I believe it's a software issue, but i'm not sure how to prove it. Any suggestions are appreciated.

TB0ne · 05-20-2015, 10:54 AM

Quote:

Originally Posted by srinishrews

Hi,
I am running a 6 node 11g RAC cluster with cluster ready services on linux. All the hosts are Virtual machines running RHEL6.6 on ESX 5.5 with dedicated heartbeat for interconnect with Jumbo frames enabled.

The issue that I'm seeing is one or two nodes gets rebooted when CRS can't communicate with the other nodes. I see the below message on /var/log/messages

exec /apps/crs/GRID/11203/perl/bin/perl -I/apps/crs/GRID/11203/perl/lib /apps/crs/GRID/11203/bin/crswrapexece.pl /apps/crs/GRID/11203/crs/install/s_crsconfig_test01_env.txt /apps/crs/GRID/11203/bin/ohasd.bin "reboot"

I checked, the storage IO and cant find any high utilization, it's running at less than 10% all the time. network is 10G, and is not showing errors on the switch. Memory usage(at 60% on an average) and CPU util(less tnan 40%) is normal.

Can somone suggest me what other information would be beneficial to check? I am seeing nothing in the logs as to any errors, or waits for disk writes. I believe it's a software issue, but i'm not sure how to prove it. Any suggestions are appreciated.

Since you're in a well-supported environment (RHEL 6.6, ESX, and Oracle 11g), you are paying for support from ALL of those vendors. The best way to diagnose this problem, is to contact Oracle. They can have you run a trace, and analyze it. If they don't find something, an SOS report to Red Hat might, and barring either of those bearing fruit, you can then present your findings to VMWare.

srinishrews · 05-20-2015, 11:02 AM

Quote:

Originally Posted by TB0ne

Since you're in a well-supported environment (RHEL 6.6, ESX, and Oracle 11g), you are paying for support from ALL of those vendors. The best way to diagnose this problem, is to contact Oracle. They can have you run a trace, and analyze it. If they don't find something, an SOS report to Red Hat might, and barring either of those bearing fruit, you can then present your findings to VMWare.

Thanks for the reply. I forgot to mention that we have don't have support from Oracle. I contacted Redhat and VMWARE and they got back to me with no findings asking me to contact oracle. We do have some kind of a third party for oracle support but they are not much of a help either. They went through the logs and tell me that it was a network issue as the logs say CRS rebooted the node cause the interconnect is not reachable.

So i was trying to find out if any of the experts here faced similar issues and may be give me some ideas where to start troubleshooting.

TB0ne · 05-20-2015, 12:24 PM

Quote:

Originally Posted by srinishrews

Thanks for the reply. I forgot to mention that we have don't have support from Oracle. I contacted Redhat and VMWARE and they got back to me with no findings asking me to contact oracle. We do have some kind of a third party for oracle support but they are not much of a help either. They went through the logs and tell me that it was a network issue as the logs say CRS rebooted the node cause the interconnect is not reachable.

So i was trying to find out if any of the experts here faced similar issues and may be give me some ideas where to start troubleshooting.

The most telling thing is that Red Hat and VMWare told you there weren't any problems, which leaves you with Oracle. The question here is WHY on earth would you run Oracle RAC without support, when you're paying for support on everything else????

Oracle can easily tell you what's up. Pay for support, and ask them. If your 'third party' won't help you, then don't pay them, since they're of no use. Pay Oracle directly. An Oracle trace will tell you what's up...could very well be there is a kernel module that's older (or NEWER) in RHEL that is causing a problem...are you patched/current with RHEL?