Should a TCP connection be able to survive a short (Few second) interruption?
I have a few machines configured in the following layout:
BOX1----[LAN]----BOX2----BOX3
Box3 can't be connected directly to the LAN, so all its traffic is routed via Box2. I have instances of a legacy application running on box1 and box3 that communicate with TCP sockets and, normally, this works just fine. The problem is that, every few hours, there is a chance that the physical connection between Box2 and Box3 will become disconnected for a couple seconds but quickly reconnected.
Until recently, the application recovered fine when this happened. Nothing happened at the application level to facilitate the reconnect. All each instance of the application saw was a few seconds of no data received, followed by a second of more data than normal, and then everything was back to normal. I'm no expert, but I thought this was due to the TCP's robustness.
My problems started when we tried to install this setup on a new set of machines. The physical configuration is exactly the same and the software hasn't changed. We did upgrade operating systems (From RHEL5 to RHEL6) and Box3 is actually running a much older network card for reasons beyond my control. Everything works fine, except that when the Box2-Box3 connection is interrupted, the TCP connection doesn't recover. Watching with netstat, we can see each port's send-queue climbing forever. We have to manually restart the application on Box1 and Box3 (the legacy application didn't have any provision for resetting sockets.)
So my question is: does anyone have any idea what could be causing the difference in behaviour? Was I wrong about the older network card not possibly having any effect on a TCP socket? Is there a Red Hat configuration option I might be missing that could cause this? I'm hoping to eventually be able to put an application-level heartbeat mechanism in to reconnect when this happens, but for now I'm hoping to figure out what's going on.
Thanks for the help.
|