LinuxQuestions.org - RT process stuck as runnable 'R', but never executes; migration thread high cputime

I have a hard to reproduce (seen a couple times in a month) issue where my program seems like it is hung (not increasing in cputime), however, it is in the runnable state and never gets to run. The cpus are 99% idle according to vmstat and the load is 5 (which is equal to the number of threads in my program that are in the R state) according to ps, and there are no processes on the system in the 'D' state. The other major oddity is that one of the migration threads has a cputime usage almost equal to the uptime of the system. Typically migration threads have a cputime on the order of seconds across hundreds of days of uptime, but in this case the migration thread has DAYS of cputime according to ps.

The last time this happened, I went around saving as much /proc/X information that I could into logs for referring to later, before I had to reboot the box to get it running back to normal. (because in this state, a kill -9 is not heeded by my program)

Does anyone have any idea what could cause this? I am not sure if this is a scheduling bug or a bug in my program (the likelier case).
I have a wealth of logs to look through if anyone can suggest something specific to look for.

Here are some snippets of basic logs:

Code:

free:

            total      used      free    shared    buffers    cached

Mem:      1030200    871436    158764          0        60    513772

-/+ buffers/cache:    357604    672596

Swap:      1023932      23928    1000004



uptime

 14:27:31 up 18 days, 12:51,  3 users,  load average: 5.12, 5.06, 5.06



ps -C myProg -L -o pid,stime,rtprio,pcpu,stat,psr,nlwp,lwp,vsz,rss,time,cmd

  PID STIME RTPRIO %CPU STAT PSR NLWP  LWP    VSZ  RSS    TIME CMD

11964 Jul14    10  0.0 RLl    1  10 11964 163584 161344 00:00:20 myProg

11964 Jul14    10  0.0 SLl    0  10 12006 163584 161344 00:01:03 myProg 

11964 Jul14    10  0.0 RLl    1  10 12008 163584 161344 00:00:00 myProg

11964 Jul14    10  0.0 SLl    0  10 12009 163584 161344 00:00:00 myProg

11964 Jul14    10  0.0 SLl    0  10 12010 163584 161344 00:00:00 myProg

11964 Jul14    11 10.3 RLl    1  10 12012 163584 161344 09:44:00 myProg

11964 Jul14    11  5.8 RLl    1  10 12017 163584 161344 05:33:30 myProg

11964 Jul14    13  6.3 SLl    0  10 12018 163584 161344 06:01:16 myProg

11964 Jul14    12  0.0 RLl    1  10 12019 163584 161344 00:00:06 myProg

11964 Jul14    12  0.0 SLl    0  10 12020 163584 161344 00:00:25 myProg

Note that these are real time processes, and as such all memory is locked via mlockall(current | future) and the threads have various real time priorities.

Code:

ps afx -F    

UID        PID  PPID  C    SZ  RSS PSR STIME TTY      STAT  TIME CMD

root        2    0  0    0    0  0 Jun30 ?        S      0:00 [kthreadd]

root        3    2  0    0    0  0 Jun30 ?        S      0:10  \_ [ksoftirqd/0]

root        6    2  0    0    0  0 Jun30 ?        S      0:00  \_ [migration/0]

root        7    2 96    0    0  1 Jun30 ?        S    25696:02  \_ [migration/1]

root        8    2  0    0    0  1 Jun30 ?        S      0:00  \_ [kworker/1:0]

root        9    2  0    0    0  1 Jun30 ?        S      0:10  \_ [ksoftirqd/1]

…

root      6194  6155  0  409  208  1 Jun30 ?        S      0:00  \_ supervise myProg

root    11964  6194 22 40896 161344  1 Jul14 ?        RLl  1280:41  |  \_ /usr/local/bin/myProg





vmstat 1

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

 r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa

 5  0  23908 187332    60 488500    0    0    13    9    7  11 15  6 78  1

 5  0  23908 187332    60 488500    0    0    0    0  877  31  0  1 99  0

 5  0  23908 187332    60 488500    0    0    0    0  881  31  0  0 100  0

 5  0  23908 187332    60 488500    0    0    0    40  894  59  0  2 99  0

 5  0  23908 187332    60 488500    0    0    0    0  902  31  0  0 100  0

There was nothing in the /var/log/kern.log around this time.
This is running on a 2 core box with the Linux Kernel 2.6.37 with the Gentoo patches.

Thanks in advance for any ideas/suggestions!

-John