server become unresponsive
Hi,
I am facing one critical issue when server become unresponsive and sometimes can’t even login using SSH also. I am running 2.6.39.1 64bit kernel on multicore/multi-processor arch. I never had this kind problem when I was using 2.6.31 kernel earlier. Most of the times, problem occurs on the systems having uptime more than 3/4 months.
Below are few observations when system become unresponsive
• Load average goes really high even around 13000.
• Any newly created process/thread including CRON jobs just stuck and never dies.
• Normally I expect around 25-30 processes on the system but in this case I can see more than 1000 processes are running on the system.
• Few of the processes show in uninterruptible sleep (disk sleep) and lot of processes shows as Runnable (but not running for some reason). Few CPU’s are completely free.
• All disk partitions are accessible. No NFS involved.
• Processes can’t be killed using kill -9, pkill -9.
A strange thing is everything recovers after I restart one of the user processes. It looks like some kind of deadlock inside the kernel recovers after my user process goes down.
Any help, direction is appreciated on this issue.
|