I've posted this problem in the centos forum at www.centos.org
, but I thought I would solicit input from the greater Linux community who might have noted this problem and who don't commonly visit the centos forum.
We're using multi-cpu, multi-core servers from Aberdeen - which are basically repackaged supermicro servers.
2.6.18-92.el5 #1 SMP Tue Jun 10 18:49:47 EDT 2008 i686 i686 i386 GNU/Linux
rpm -qa kernel\* | sort
This problem has been noted with UDP sockets. We're not sure if it also happens with TCP sockets.
Occasionally, when a non-blocking UDP socket is polled using the select() function with a zeroed timeval structure, we note that the select() stalls for just over 70 minutes. We wish to respond quickly when packets appear spontaneously on this socket, but the opposite socket very, very rarely spontaneously transmits a packet. It is common for no packet to be spontaneously transmitted to this socket for many hours.
We find it quite coincidental that 0xFFFFFFFF in usec resolution equals 71 minutes, 35 seconds. We hypothesize that the usec component of the zeroed timeval structure provided to select() is occasionally being decremented to 0xFFFFFFFF (or the equivalent in "jiffies") prior to the OS testing if it is equal to zero. Thus, we incur a 71 minute, 35 second timeout. We poll this socket at quite a high rate (e.g. 50 Hz) and this problem might occur once or twice over 12 hours. It is apparently quite sensitive to precisely when the select() function is called in relation to the whatever clocks drive the OS to decrement socket timeouts.
We have searched the RedHat bug list, the centos forum, and this site and have not found any similar complaints using select() with a zeroed timeout. Has anyone else observed this behavior? Is there a remedy that entails something other than avoiding zero timeouts or a watchdog on threads that might perform zero timeout select() calls? Our product also employs a library that may perform zero timeout select() calls, so we'd prefer an OS level solution. We didn't notice anything in the centos v5.3 release notes to indicate that such a problem has been recognized and addressed.
I am not an OS level programmer, so I don't have a good feel for whether this problem is due to a unique interaction of v5.2 centos and our Aberdeen peculiar server hardware. If it isn't peculiar to our hardware, I'd have thought there would already be plenty of posts about this issue on-line.
Despite the vast number of Linux installations, I suppose it's possible a problem such as this might go unnoticed for an extended period of time. It manifests very infrequently given the number of opportunities. And one might only recognize it happens if the socket they are polling using select() with a zeroed timeout only very, very rarely receives packet traffic. Otherwise, the select() would return due to the reception of that traffic.