RT process stuck as runnable 'R', but never executes; migration thread high cputime
I have a hard to reproduce (seen a couple times in a month) issue where my program seems like it is hung (not increasing in cputime), however, it is in the runnable state and never gets to run. The cpus are 99% idle according to vmstat and the load is 5 (which is equal to the number of threads in my program that are in the R state) according to ps, and there are no processes on the system in the 'D' state. The other major oddity is that one of the migration threads has a cputime usage almost equal to the uptime of the system. Typically migration threads have a cputime on the order of seconds across hundreds of days of uptime, but in this case the migration thread has DAYS of cputime according to ps.
The last time this happened, I went around saving as much /proc/X information that I could into logs for referring to later, before I had to reboot the box to get it running back to normal. (because in this state, a kill -9 is not heeded by my program) Does anyone have any idea what could cause this? I am not sure if this is a scheduling bug or a bug in my program (the likelier case). I have a wealth of logs to look through if anyone can suggest something specific to look for. Here are some snippets of basic logs: Code:
free: Code:
ps afx -F This is running on a 2 core box with the Linux Kernel 2.6.37 with the Gentoo patches. Thanks in advance for any ideas/suggestions! -John |
How many cores ?. Hiperthreading turned on ?. Let's see the output of this
Code:
grep -iE "processor|core|sibling" /proc/cpuinfo |
Code:
# grep -iE "processor|core|sibling" /proc/cpuinfo |
Figured as much from your initial post. Have you tried disabling hiperthreading (as a test). If you don't want to fiddle with the BIOS, just boot Linux with maxcpus=1.
Maybe also try "maxcpus=0" - this also only uses 1 "core", but also disable the SMP code. May not be a (good) long-term solution, but may help isolate the problem. |
Well I hadn't considered disabling hyperthreading. The trouble is that with such a long time in between failures, I won't even know if that had any effect for over a month.
Hypothetically, if I did turn off hyperthreading and I never saw the problem again, what would be the next thing to look at? I wouldn't be terribly surprised if that did fix the issue, but then I'm not sure what to do past that, and I would rather not have my box run in this mode forever. Thanks, John |
All times are GMT -5. The time now is 11:48 AM. |