LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 07-28-2011, 12:53 PM   #1
senorsmith
LQ Newbie
 
Registered: Jul 2011
Posts: 3

Rep: Reputation: Disabled
RT process stuck as runnable 'R', but never executes; migration thread high cputime


I have a hard to reproduce (seen a couple times in a month) issue where my program seems like it is hung (not increasing in cputime), however, it is in the runnable state and never gets to run. The cpus are 99% idle according to vmstat and the load is 5 (which is equal to the number of threads in my program that are in the R state) according to ps, and there are no processes on the system in the 'D' state. The other major oddity is that one of the migration threads has a cputime usage almost equal to the uptime of the system. Typically migration threads have a cputime on the order of seconds across hundreds of days of uptime, but in this case the migration thread has DAYS of cputime according to ps.

The last time this happened, I went around saving as much /proc/X information that I could into logs for referring to later, before I had to reboot the box to get it running back to normal. (because in this state, a kill -9 is not heeded by my program)

Does anyone have any idea what could cause this? I am not sure if this is a scheduling bug or a bug in my program (the likelier case).
I have a wealth of logs to look through if anyone can suggest something specific to look for.


Here are some snippets of basic logs:

Code:
free:
             total       used       free     shared    buffers     cached
Mem:       1030200     871436     158764          0         60     513772
-/+ buffers/cache:     357604     672596
Swap:      1023932      23928    1000004

uptime
 14:27:31 up 18 days, 12:51,  3 users,  load average: 5.12, 5.06, 5.06

ps -C myProg -L -o pid,stime,rtprio,pcpu,stat,psr,nlwp,lwp,vsz,rss,time,cmd
  PID STIME RTPRIO %CPU STAT PSR NLWP   LWP    VSZ   RSS     TIME CMD
11964 Jul14     10  0.0 RLl    1   10 11964 163584 161344 00:00:20 myProg
11964 Jul14     10  0.0 SLl    0   10 12006 163584 161344 00:01:03 myProg 
11964 Jul14     10  0.0 RLl    1   10 12008 163584 161344 00:00:00 myProg
11964 Jul14     10  0.0 SLl    0   10 12009 163584 161344 00:00:00 myProg
11964 Jul14     10  0.0 SLl    0   10 12010 163584 161344 00:00:00 myProg
11964 Jul14     11 10.3 RLl    1   10 12012 163584 161344 09:44:00 myProg
11964 Jul14     11  5.8 RLl    1   10 12017 163584 161344 05:33:30 myProg
11964 Jul14     13  6.3 SLl    0   10 12018 163584 161344 06:01:16 myProg
11964 Jul14     12  0.0 RLl    1   10 12019 163584 161344 00:00:06 myProg
11964 Jul14     12  0.0 SLl    0   10 12020 163584 161344 00:00:25 myProg
Note that these are real time processes, and as such all memory is locked via mlockall(current | future) and the threads have various real time priorities.
Code:
ps afx -F    
UID        PID  PPID  C    SZ   RSS PSR STIME TTY      STAT   TIME CMD
root         2     0  0     0     0   0 Jun30 ?        S      0:00 [kthreadd]
root         3     2  0     0     0   0 Jun30 ?        S      0:10  \_ [ksoftirqd/0]
root         6     2  0     0     0   0 Jun30 ?        S      0:00  \_ [migration/0]
root         7     2 96     0     0   1 Jun30 ?        S    25696:02  \_ [migration/1]
root         8     2  0     0     0   1 Jun30 ?        S      0:00  \_ [kworker/1:0]
root         9     2  0     0     0   1 Jun30 ?        S      0:10  \_ [ksoftirqd/1]
…
root      6194  6155  0   409   208   1 Jun30 ?        S      0:00  \_ supervise myProg
root     11964  6194 22 40896 161344  1 Jul14 ?        RLl  1280:41  |   \_ /usr/local/bin/myProg


vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 5  0  23908 187332     60 488500    0    0    13     9    7   11 15  6 78  1
 5  0  23908 187332     60 488500    0    0     0     0  877   31  0  1 99  0
 5  0  23908 187332     60 488500    0    0     0     0  881   31  0  0 100  0
 5  0  23908 187332     60 488500    0    0     0    40  894   59  0  2 99  0
 5  0  23908 187332     60 488500    0    0     0     0  902   31  0  0 100  0
There was nothing in the /var/log/kern.log around this time.
This is running on a 2 core box with the Linux Kernel 2.6.37 with the Gentoo patches.

Thanks in advance for any ideas/suggestions!

-John
 
Old 07-28-2011, 07:40 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,103

Rep: Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117
How many cores ?. Hiperthreading turned on ?. Let's see the output of this
Code:
grep -iE "processor|core|sibling" /proc/cpuinfo
 
Old 07-29-2011, 09:51 AM   #3
senorsmith
LQ Newbie
 
Registered: Jul 2011
Posts: 3

Original Poster
Rep: Reputation: Disabled
Code:
# grep -iE "processor|core|sibling" /proc/cpuinfo
processor	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
processor	: 1
siblings	: 2
core id		: 0
cpu cores	: 1
 
Old 07-29-2011, 06:12 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,103

Rep: Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117Reputation: 4117
Figured as much from your initial post. Have you tried disabling hiperthreading (as a test). If you don't want to fiddle with the BIOS, just boot Linux with maxcpus=1.
Maybe also try "maxcpus=0" - this also only uses 1 "core", but also disable the SMP code.

May not be a (good) long-term solution, but may help isolate the problem.
 
Old 07-30-2011, 02:26 PM   #5
senorsmith
LQ Newbie
 
Registered: Jul 2011
Posts: 3

Original Poster
Rep: Reputation: Disabled
Well I hadn't considered disabling hyperthreading. The trouble is that with such a long time in between failures, I won't even know if that had any effect for over a month.
Hypothetically, if I did turn off hyperthreading and I never saw the problem again, what would be the next thing to look at?
I wouldn't be terribly surprised if that did fix the issue, but then I'm not sure what to do past that, and I would rather not have my box run in this mode forever.

Thanks,

John
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] What part of the boot process executes just before the login screen? Vanyel Ubuntu 4 09-13-2019 04:28 PM
How to make a Runnable Thread in Linux?! MODYSAMA Linux - Newbie 13 03-05-2011 08:09 AM
Get Process size and Thread count for a particular running process haseit Linux - Newbie 2 01-22-2009 11:09 PM
How to find a swapped out runnable process? slice Linux - General 0 09-29-2004 08:51 AM
Multithreaded System On 4 Cpu Linux Machine, process stuck on certain thread eyalzm Programming 1 05-10-2004 11:46 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration