Yes, I know, RHEL4 is ancient and unsupported. Still, it's what I have to work with.
I have a RHEL4 machine running the most recent 2.6.9 kernel. It mounts a NFS3 filesystem, and runs an application from this share. The application reads a number of large data files, each about 1Gb in size.
After a reboot, the machine runs as anyone would expect. My application reads in the required files and processes them in a timely fashion. However, randomly, after days, or weeks, the performance goes through the floor.
When the problem occurs ( and before hand ) I can observe that the kernel has plenty of stuff in its cache, with the cache usually at least twice the size of the data being read from NFS. However iostat shows that NFS is constantly being hit for *small* reads while the application is running. To me, it appears the kernel did not cache the entire input data file.
While my program was running, I verified that it had not cached the entire input through timing a read of the entire data set to /dev/null. First read takes a long time ( as not all the data is cached ). Subsequent reads are quick, as expected as all the data is then fully cached.
For whatever reason, once the problem occurs, it appears that this old kernel is flushing, or perhaps not caching portions of the input data files from NFS. Again, a reboot will restore performance, and restore proper caching of the input data files.
I note more recent kernels ( and NFS client/server versions ) have additional options for controlling caching. There appears to be limited room to move with RHEL4 though.
Can anyone make any suggestions, or comment on specific bugs that may be causing this behavior?