Network breakdowns involving TCP RST messages

downer · 05-15-2012, 07:24 AM

Hello everyone.

While this is not technically a Linux-specific issue, all the clients involved, and the server are running CentOS6 (Kernel 2.6.32-220), so I hope I am justified in posting this here.

Here is the setup:
A server handling dhcp, dns, nfs, nis and proxy services for a network of 15 clients, all running CentOS6. The clients access the internet through a squid proxy on port 3128. They also access an http server in our network through the proxy. The server has two network interfaces, one connecting to the web and our own http server, and one connecting to the clients.

Here's what's happening:
At irregular intervals, the clients' internet connection breaks down for a short period (30 seconds give or take). All clients are affected simultaneously, and connections to the http server in our network also break down.

Here's what I did:
I ran Wireshark on the server, both interfaces, to see what, if anything, was happening during the breakdowns, and I noticed that, during this time, a high number of TCP RST messages of this form (from servers on the internet):

Code:

26234	51.937961	80.239.230.169	192.168.2.35	TCP	http > 39637 [RST] Seq=1 Win=0 Len=0

or from our own proxy:

Code:

24879	44.966533	192.168.168.16	192.168.2.22	TCP	ndl-aas > 57333 [RST] Seq=5280 Win=0 Len=0

occur.

I did some googling, and, from sources too numerous to list, constructed the following scenario:
Sometimes, when a client terminates a connection to a server, including our own proxy, that message never makes it to the server, either because somewhere on the way, a network device does not pass it on, or because it is already malformed coming from the client. So, from the information available to each of them, the server assumes that there is a connection, while the client assumes that there is not. After a while, the server determines that there is no traffic on the connection and ends it with a TCP reset message to the client. The client, assuming that the connection was already terminated, resets its current connection instead, resulting in an error for the user. Since servers flush inactive connections periodically, or perhaps because the improperly terminated connections happen around the same time, the entire network is affected at the same time.

Now, to my questions:
1. Is this scenario possible? Maybe a modified form of it? I pretty much turned my network upside down to find the problem, since it annoys the heck out of my users, so I'm grasping at straws to find an explanation.
2. If not, what else could cause this behaviour? From what I gather, servers don't send TCP reset messages without good reason, and IIRC older wireshark logs (which I fail to find right now) have displayed no such messages (or at least not so many), nor are they prominent in current logs when the network is running fine.
3. What tools could I use to pinpoint the error? How do I determine what prompts the resets, and whether the resets cause the errors that keep pestering my users? I already borrowed an HP ProCurve switch with built-in monitoring to replace my Linksys switches, but I have yet to find a nullmodem cable to program it, and since those cables seem to have gone out of style with disco, I'd like an opinion on whether to bother looking for one any more. The idea was to monitor traffic at the switch rather than the server, to see if the switch loses any packets.

Sorry, that was a lot of text, especially for a first post on this forum. I hope it made sense, but as you may imagine, I'm a little frustrated and confused right now and have a lot of upset users knocking on my door asking when the network will be back to normal.

Hoping for some help,
André

nikmit · 05-15-2012, 07:40 AM

I think it is more likely that the resets are symptomatic rather than causing the problem. For a RST packet to reset the connection, the source/destination IP and port have to match, as well as the sequence number. A 'forgotten' RST packet will not have the correct sequence number.

The resets could be caused by timeouts, which in turn can be caused in various ways

Dump the traffic for a single host to a file over a prolonged period of time, so you can capture the entire blackout.
tcpdump host 1.2.3.4 and tcp and port 80 > /somedir/somefile

nini09 · 05-15-2012, 02:20 PM

How about your traffic when the issue come out, heavy or light? If your server can't handle heavy traffic, a lot of RST packet will be generated on client and server.