TCP handshake fails, SYN/ACK ignored by system.
Hi,
We are experiencing an intermittent TCP handshake problem between two of our servers. 10.2.2.21 is running apache on CENTOS 5 and is a proxy for 10.2.2.30 which is running IIS on Windows 2003. Below is a tcpdump showing a normal TCP handshake, and below that the one where it fails. We notice long outages during a request when this happens, and it is always exactly a 45 second wait.
All port 80 traffic from the internet is forwarded by DNAT rule on our border router (CENTOS 5 running shorewall/iptables firewall) to 10.2.2.21 and then the apache proxy hands it over to the IIS on 10.2.2.30
The TCP dump is identical on both machines so the SYN/ACK is actually received by the network interface on the Linux server, but for some reason the TCP stack doesn't respond with an ACK for quite some time, and keeps sending SYN's as if the SYN/ACK never arrived.
Normal:
No. Time Source Destination Protocol Info
1298 1.995355 10.2.2.21 10.2.2.30 TCP 60447 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=3170731083 TSER=0 WS=7
1299 1.995383 10.2.2.30 10.2.2.21 TCP http > 60447 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460 WS=0 TSV=0 TSER=0
1300 1.995500 10.2.2.21 10.2.2.30 TCP 60447 > http [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=3170731083 TSER=0
1301 1.995755 10.2.2.21 10.2.2.30 HTTP GET /home.asp HTTP/1.1
Problem:
No. Time Source Destination Protocol Info
2244 7.564111 10.2.2.21 10.2.2.30 TCP 60527 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=3170736652 TSER=0 WS=7
2245 7.564145 10.2.2.30 10.2.2.21 TCP http > 60527 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460 WS=0 TSV=0 TSER=0
2246 10.564549 10.2.2.21 10.2.2.30 TCP 60527 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=3170739652 TSER=0 WS=7
2247 10.732139 10.2.2.30 10.2.2.21 TCP http > 60527 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460 WS=0 TSV=0 TSER=0
2248 16.564328 10.2.2.21 10.2.2.30 TCP 60527 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=3170745652 TSER=0 WS=7
2249 17.294295 10.2.2.30 10.2.2.21 TCP http > 60527 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460 WS=0 TSV=0 TSER=0
2250 28.563889 10.2.2.21 10.2.2.30 TCP 60527 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=3170757652 TSER=0 WS=7
2251 52.563006 10.2.2.21 10.2.2.30 TCP 60527 > http [SYN] Seq=0 Len=0 MSS=1460 TSV=3170781652 TSER=0 WS=7
2252 52.563040 10.2.2.30 10.2.2.21 TCP [TCP Previous segment lost] http > 60527 [SYN, ACK] Seq=11717498 Ack=1 Win=16384
Len=0 MSS=1460 WS=0 TSV=0 TSER=0
2253 52.563150 10.2.2.21 10.2.2.30 TCP 60527 > http [ACK] Seq=1 Ack=11717499 Win=5888 Len=0 TSV=3170781652 TSER=0
2254 52.563473 10.2.2.21 10.2.2.30 HTTP GET /userecncount.asp?id=585 HTTP/1.1
Both servers are Dell Poweredge 2950, quad core servers with 3GB of Memory. They aren't overloaded at all by the processes they are running. The Linux server has 2 interfaces bonded in a failover configuration, and the Windows machine has 2 interfaces in a failover teaming configuration. The teaming and bonding is set up so that only one interface is live, there is no load balancing. As the TCP dump is identical on both machines, I doubt the interfaces are the culprit.
The netowrk bandwith on the Linux machine is nowhere overloaded either.
Something that might be related is that doing a 'netstat -antp' on the linux machine shows that there are always aroud 2500 open sockets with status TIME_WAIT. This is because we have a lot of database connections going from this machine to a Postgresql server.
As far as I know /proc/sys/fs/file-max value is 295328, so isn't that the maximum amount of sockets the system can open? So surely 2500 open sockets shouldn't congest the TCP stack?
Surely a configuration option for apache wouldn't influence a low level TCP handshake as this is at layer 4?
Thanks,
Last edited by xnomad; 04-23-2008 at 12:16 AM.
|