Troubleshooting Socket Issues

JockVSJock · 06-29-2015, 11:36 AM

I'm not sure where the issue is, however I have a RHEL v5 that has an Oracle Database along with a Java Based Application on it.

The end users try to run this Java Based Application from their pc and it takes along time for it to connect. Sometimes it doesn't connect at all. All network traffic is going across the LAN. I've also been able to ping and traceroute from the server back to a pc and vice versa, so I know its not iptables.

The DBA is looking into the database side and I'm looking at the OS side. The error message is referencing an ORA message, however I still want to do my part and make sure its not the OS.

I can see the network connections seem to be ok with the following command:

Code:

netstat -ap | grep ESTA

I'm also looking at socket settings as well, such as setting under /proc/sys/net/ipv4

Code:

[root@foo ipv4]# cat tcp_keepalive_intvl ; cat tcp_keepalive_probes ; cat tcp_keepalive_time
75
9
7200
[root@foo ipv4]#

I've also looked the following values under /proc/sys/net/core

Code:

[root@ameda4aisrx0223 core]# cat rmem_default ; cat rmem_max ; cat wmem_default ; cat wmem_max ; cat optmem_max
4194304
4194304
262144
1048576
10240
[root@ameda4aisrx0223 core]#

Is there anything else that I should look at to troubleshoot, or have I taken it as far as I can take it?

thanks

JockVSJock · 06-29-2015, 02:23 PM

I ran tcpdump against the interface (eth0) while traffic was being sent to port 1521 and noticed this:

Code:

192.168.50.8.50144 > destination: P, cksum 0xe0c1 (correct), 58632:58653(21) ack 692526 win 11 
13:0310.801291 IP (tos 0x0, ttl 64, id 2774, offset 0, flags [DF], proto: TCP (6), length: 52) 
destination > 192.168.50.8.50144: P, cksum 0xe596 (incorrect (-> 0xef59), 
692526:692538 (12) ack 58653 win 218

I'm not sure what is going on with the incorrect value from the destination server to the 192.168.50.8.

I looked at the values of eth0 using ethtool, and looked at a few blogs online where either checksum offloading or tcpoffloading is turned off, however this is the first time I have seen this, so I'm not sure what would be the best course of action:

Code:

[root@foo core]# ethtool eth0

Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   1000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  Not reported
        Advertised auto-negotiation: No
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Link detected: yes

 
[root@foo core]# ethtool -k eth0

Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off

padeen · 06-30-2015, 09:18 PM

Isn't that error msg saying that the destination sent a corrupt packet? I may be laying a red herring, but that would tie in with the symptom of long login times while corrupt pkts are discarded until valid ones are received. Have you run the pings for a huge number of datagrams and from the Oracle machine to yours? OTOH ping datagrams may not necessarily show up as corrupt in any case.

I would be looking at the (possibly failing) nic card for replacement on the Oracle machine. Since they're so cheap, it's an easy and quick elimination of a possible contributor to the problem. (Of course, it could be your nic that is calculating the checksum incorrectly or it could be the cable corrupting the data. The joys of network troubleshooting...)

JockVSJock · 06-30-2015, 09:40 PM

Quote:

Originally Posted by padeen

I would be looking at the (possibly failing) nic card for replacement on the Oracle machine. Since they're so cheap, it's an easy and quick elimination of a possible contributor to the problem. (Of course, it could be your nic that is calculating the checksum incorrectly or it could be the cable corrupting the data. The joys of network troubleshooting...)

Crap, I forgot to mention that this is a VM in VMWare vCenter, not a physical machine. Which introduces a whole level of complexity to the situation.

This VM is shares a data store on a SAN with a number of other VMs, and they don't seem to have any networking issues either, or at least I don't see any issues with them.

I didn't think of the idea of trying to send bigger packets from the Oracle machine to the client. I would have to read up on this because I've never done this before.

padeen · 07-01-2015, 03:29 AM

No I didn't mean bigger datagrams, I meant more. I was thinking along the lines of a degrading nic that occasionally sends corrupt packets, in which case you would have to capture a lot of them to see this.

As to the VMWare, I can't offer any help as I don't use it.

JockVSJock · 07-01-2015, 10:02 AM

This is what did to fix the incorrect error.

I changed the driver that was tied to the NIC in RHEL, went from Flexible to VMXNET3 and once doing that we ran the test again and I watched the tcpdump traffic and no longer see the incorrect error. The error we are getting now says: Socket red time out62000

However we are still getting the socket error, however I'm starting to lean more towards that this issue maybe with the software and how it is trying to connect.