Hello everyone.
While this is not technically a Linux-specific issue, all the clients involved, and the server are running CentOS6 (Kernel 2.6.32-220), so I hope I am justified in posting this here.
Here is the setup:
A server handling dhcp, dns, nfs, nis and proxy services for a network of 15 clients, all running CentOS6. The clients access the internet through a squid proxy on port 3128. They also access an http server in our network through the proxy. The server has two network interfaces, one connecting to the web and our own http server, and one connecting to the clients.
Here's what's happening:
At irregular intervals, the clients' internet connection breaks down for a short period (30 seconds give or take). All clients are affected simultaneously, and connections to the http server in our network also break down.
Here's what I did:
I ran Wireshark on the server, both interfaces, to see what, if anything, was happening during the breakdowns, and I noticed that, during this time, a high number of TCP RST messages of this form (from servers on the internet):
Code:
26234 51.937961 80.239.230.169 192.168.2.35 TCP http > 39637 [RST] Seq=1 Win=0 Len=0
or from our own proxy:
Code:
24879 44.966533 192.168.168.16 192.168.2.22 TCP ndl-aas > 57333 [RST] Seq=5280 Win=0 Len=0
occur.
I did some googling, and, from sources too numerous to list, constructed the following scenario:
Sometimes, when a client terminates a connection to a server, including our own proxy, that message never makes it to the server, either because somewhere on the way, a network device does not pass it on, or because it is already malformed coming from the client. So, from the information available to each of them, the server assumes that there is a connection, while the client assumes that there is not. After a while, the server determines that there is no traffic on the connection and ends it with a TCP reset message to the client. The client, assuming that the connection was already terminated, resets its current connection instead, resulting in an error for the user. Since servers flush inactive connections periodically, or perhaps because the improperly terminated connections happen around the same time, the entire network is affected at the same time.
Now, to my questions:
1. Is this scenario possible? Maybe a modified form of it? I pretty much turned my network upside down to find the problem, since it annoys the heck out of my users, so I'm grasping at straws to find an explanation.
2. If not, what else could cause this behaviour? From what I gather, servers don't send TCP reset messages without good reason, and IIRC older wireshark logs (which I fail to find right now) have displayed no such messages (or at least not so many), nor are they prominent in current logs when the network is running fine.
3. What tools could I use to pinpoint the error? How do I determine what prompts the resets, and whether the resets cause the errors that keep pestering my users? I already borrowed an HP ProCurve switch with built-in monitoring to replace my Linksys switches, but I have yet to find a nullmodem cable to program it, and since those cables seem to have gone out of style with disco, I'd like an opinion on whether to bother looking for one any more. The idea was to monitor traffic at the switch rather than the server, to see if the switch loses any packets.
Sorry, that was a lot of text, especially for a first post on this forum. I hope it made sense, but as you may imagine, I'm a little frustrated and confused right now and have a lot of upset users knocking on my door asking when the network will be back to normal.
Hoping for some help,
André