LinuxQuestions.org - RHEL 5 two node cluster getting error

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - RHEL 5 two node cluster getting error (https://www.linuxquestions.org/questions/linux-server-73/rhel-5-two-node-cluster-getting-error-4175491321/)

RHEL 5 two node cluster getting error

Hi,

I'm using RHEL 5 two node cluster, Its getting below errors

[root@nocidsdb02 ~]# uname -a
Linux nocidsdb02.nlb.gov.sg 2.6.18-371.el5 #1 SMP Thu Sep 5 21:21:44 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

Jan 15 13:20:18 nocidsdb02 kernel: RPC: bad TCP reclen 0x47455420 (non-terminal)
Jan 15 13:20:18 nocidsdb02 kernel: RPC: bad TCP reclen 0x00620103 (large)
Jan 15 13:20:18 nocidsdb02 kernel: RPC: bad TCP reclen 0x4a524d49 (non-terminal)
Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (542393671)... ignoring.
Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (542393671)... ignoring.
Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (50422400)... ignoring.
Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (1229804106)... ignoring.
Jan 15 13:20:31 nocidsdb02 kernel: RPC: bad TCP reclen 0x47455420 (non-terminal)
Jan 15 13:20:31 nocidsdb02 kernel: RPC: bad TCP reclen 0x00620103 (large)
Jan 15 13:20:31 nocidsdb02 kernel: RPC: bad TCP reclen 0x4a524d49 (non-terminal)

[root@nocidsdb02 ~]# ifconfig
bond0 Link encap:Ethernet HWaddr 44:1E:A1:4A:1A:98
inet addr:192.168.1.52 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:3301089 errors:0 dropped:0 overruns:0 frame:0
TX packets:1838407 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:579770741 (552.9 MiB) TX bytes:341855214 (326.0 MiB)

eth0 Link encap:Ethernet HWaddr 3C:D9:2B:FD:1B:38
inet addr:172.30.13.110 Bcast:172.30.13.127 Mask:255.255.255.192
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:29846872 errors:0 dropped:0 overruns:0 frame:0
TX packets:56428596 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3006200476 (2.7 GiB) TX bytes:80404189768 (74.8 GiB)
Interrupt:170 Memory:f8000000-f8012800

eth2 Link encap:Ethernet HWaddr 44:1E:A1:4A:1A:98
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:3300295 errors:0 dropped:0 overruns:0 frame:0
TX packets:1838407 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:579713163 (552.8 MiB) TX bytes:341855214 (326.0 MiB)

eth3 Link encap:Ethernet HWaddr 44:1E:A1:4A:1A:98
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:794 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:57578 (56.2 KiB) TX bytes:0 (0.0 b)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1256105 errors:0 dropped:0 overruns:0 frame:0
TX packets:1256105 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:404512516 (385.7 MiB) TX bytes:404512516 (385.7 MiB)

[root@nocidsdb02 ~]# ifconfig eth2 | grep MTU
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
[root@nocidsdb02 ~]# ifconfig eth3 | grep MTU
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1

[root@nocidsdb02 ~]# clustat
Cluster Status for ids_db @ Wed Jan 15 14:32:43 2014
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
nocidsdbsvr01 1 Online, rgmanager
nocidsdbsvr02 2 Online, Local, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:metalib nocidsdbsvr01 started
service:primo nocidsdbsvr01 started

When I fail back the service node 2 is rebooting.
Can help me to check

Quote:

Originally Posted by thirupathi (Post 5098230)

Hi,
I'm using RHEL 5 two node cluster, Its getting below errors

Code:

Jan 15 13:20:18 nocidsdb02 kernel: RPC: bad TCP reclen 0x47455420 (non-terminal)

Jan 15 13:20:18 nocidsdb02 kernel: RPC: bad TCP reclen 0x00620103 (large)

Jan 15 13:20:18 nocidsdb02 kernel: RPC: bad TCP reclen 0x4a524d49 (non-terminal)

Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (542393671)... ignoring.

Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (542393671)... ignoring.

Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (50422400)... ignoring.

Jan 15 13:20:30 nocidsdb02 ccsd[6603]: Unexpected communication type (1229804106)... ignoring.

Jan 15 13:20:31 nocidsdb02 kernel: RPC: bad TCP reclen 0x47455420 (non-terminal)

Jan 15 13:20:31 nocidsdb02 kernel: RPC: bad TCP reclen 0x00620103 (large)

Jan 15 13:20:31 nocidsdb02 kernel: RPC: bad TCP reclen 0x4a524d49 (non-terminal)

When I fail back the service node 2 is rebooting. Can help me to check

You've been using RHEL clustering since 2012:
http://www.linuxquestions.org/questi...up-4175427784/
http://www.linuxquestions.org/questi...lp-4175427826/

Please see some of the answers in your other threads, where you were directed to the clustering documentation:
https://access.redhat.com/site/docum...dministration/
https://access.redhat.com/site/solutions/22484

....and to Red Hat support. If you're using RHEL, you need to PAY FOR IT, which entitles you to support. Also, you only say RHEL5...not which version. The ONLY currently-supported version of RHEL5 is 5.9...and that's only with paid-for extended support. This is covered in the RHEL solutions guide, which you have access to since you're paying for RHEL. There are bugfixes which address such things...and again, they're available if you PAY for RHEL.

Hi,

I'm still getting the same errors, As per RedHat suggested to check multicast and NIC drivers and MTU at Network switch side and Servers side.
1. I have updated NIC drivers.
2. I'm not using jumbo frames
Below are NIC's statistics.

#ethtool_-S_eth3

NIC statistics:
rx_crc_errors: 0
rx_alignment_symbol_errors: 0
rx_pause_frames: 0
rx_control_frames: 0
rx_in_range_errors: 0
rx_out_range_errors: 0
rx_frame_too_long: 0
rx_address_mismatch_drops: 12224 <-------- Packets dropped
rx_dropped_too_small: 0
rx_dropped_too_short: 0
rx_dropped_header_too_small: 0
rx_dropped_tcp_length: 0
rx_dropped_runt: 0
rxpp_fifo_overflow_drop: 0
rx_input_fifo_overflow_drop: 0

#ethtool_-S_eth2

NIC statistics:
rx_crc_errors: 0
rx_alignment_symbol_errors: 0
rx_pause_frames: 0
rx_control_frames: 0
rx_in_range_errors: 0
rx_out_range_errors: 0
rx_frame_too_long: 0
rx_address_mismatch_drops: 657 <------ Packets dropped
rx_dropped_too_small: 0
rx_dropped_too_short: 0
rx_dropped_header_too_small: 0
rx_dropped_tcp_length: 0
rx_dropped_runt: 0
rxpp_fifo_overflow_drop: 0

Drivers

#ethtool_-i_eth2

driver: be2net
version: 4.2.116r
firmware-version: 4.6.247.5
bus-info: 0000:02:00.0

#ethtool_-i_eth3
driver: be2net
version: 4.2.116r
firmware-version: 4.6.247.5

#ethtool_-k_eth2
Cannot get device udp large send offload settings: Operation not supported Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: on <-----------

#ethtool_-k_eth3
Cannot get device udp large send offload settings: Operation not supported Offload parameters for eth3:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: on

RedHat suggested to disable the "generic-receive-offload" for NIC's after changes also getting the same errors.

Can anyone help to overcome these error.

thanks,

Quote:

Originally Posted by thirupathi (Post 5116751)

Hi,
I'm still getting the same errors, As per RedHat suggested to check multicast and NIC drivers and MTU at Network switch side and Servers side.
1. I have updated NIC drivers.
2. I'm not using jumbo frames

RedHat suggested to disable the "generic-receive-offload" for NIC's after changes also getting the same errors. Can anyone help to overcome these error.

So, Red Hat told you to disable a feature to overcome the problem...and you left it on, and are now asking how to solve the problem??? Why haven't you done what Red Hat told you to do?

Start by doing that, and see what happens.

I have disabled "generic-receive-offload" for NIC's the as RedHat suggested. But error still appearing.

[root@nocidsdb01 ~]# ethtool -k eth2
Offload parameters for eth2:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off

Feb 14 17:42:15 nocidsdb01 kernel: RPC: bad TCP reclen 0x47455420 (non-terminal)
Feb 14 17:42:15 nocidsdb01 kernel: RPC: bad TCP reclen 0x00620103 (large)
Feb 14 17:42:15 nocidsdb01 kernel: RPC: bad TCP reclen 0x4a524d49 (non-terminal)
Feb 14 17:42:24 nocidsdb01 kernel: RPC: bad TCP reclen 0x47455420 (non-terminal)
Feb 14 17:42:24 nocidsdb01 kernel: RPC: bad TCP reclen 0x00620103 (large)
Feb 14 17:42:24 nocidsdb01 kernel: RPC: bad TCP reclen 0x4a524d49 (non-terminal)

Quote:

Originally Posted by thirupathi (Post 5117496)

Ok...so what did Red Hat say after you told them this? If you're paying for support from them, then you should be using it. I'm surprised they haven't had you run some more diagnostics...or have they? Did you call them back and tell them their suggested solution didn't work, and ask for someone who supports clustering specifically?

Hi,

I have given sosreports to Redhat and they are still checking with experts. My case is open with redhat since 50days they can not find the cause.
I have changed the some settings as they suggested but still getting the error.

thks..

Quote:

Originally Posted by thirupathi (Post 5119071)

Hi,
I have given sosreports to Redhat and they are still checking with experts. My case is open with redhat since 50days they can not find the cause.
I have changed the some settings as they suggested but still getting the error.

Sorry, but I have to doubt this. You are PAYING FOR support, and for Red Hat not to escalate the issue seems very odd. And if this is a production system, how has your employer not escalated this issue further, since leaving a production system having problems for almost two months would be a bad thing.

Also, there are some mentions of bugfixes in the RHEL customer portal (which you can access with your RHEL subscription; ask RHEL support for help) which address things that sound similar. You STILL don't say what version of RHEL5, but again we will tell you that only 5.9 is currently supported under extended support.