Multithreaded process pausing but not deadlocking or crashing

writejus1 · 09-02-2005, 11:20 AM

Hi,

I am writing a largely multithreaded linux program (20-60 threads) on version Fedora Core 2. I am using glib c version 2.3.3-27. In addition, I am using the boost (boost.org) libraries (version 1.32.0) for my threading and locking.

My problem is that the process will suddenly cease activity for random lengths of time (1 sec to minutes). However, it never crashes or produces incorrect results. Also, I do not think that it is deadlocking because it always resumes its activity.

I have done some profiling of the locks, and it shows very strange behavior. For instance, threads will block for long lengths of time (the length of the inactivity) while no thread is holding the corresponding lock more than fractions of a second. When I explored this further, it appears that thread A is blocking on a mutex while thread B holds it. I am using boost::recursive_mutex::scoped_lock objects for the locking. The weird thing is that thread B pauses at the very end of the lock's scope, as though the attempt to unlock the mutex is not waking thread A and descheduling thread B for a long time.

I created a test program that spawns 30 threads that just do a bunch of locking of these boost scoped locks and yielding. This program, too, shows the same downtime activity (again without crashing or deadlocking), though less frequently (I suspect because the locking pattern is probably different than in my program).

As far as I can tell, the boost libraries don't do much more than provide wrappers for pthread functionality, so I'm not sure whether this issue is a boost problem, a kernel problem, or my problem.

I was wondering if anyone has experienced similar behavior on linux, or in using these boost libraries? If anyone could offer some insight guidance, it would be much appreciated. Thanks!

(Also, please let me know if there is a more appropriate forum for this issue).

Matt

jailbait · 09-02-2005, 03:54 PM

"threads will block for long lengths of time (the length of the inactivity) while no thread is holding the corresponding lock more than fractions of a second. When I explored this further, it appears that thread A is blocking on a mutex while thread B holds it. I am using boost::recursive_mutex::scoped_lock objects for the locking. The weird thing is that thread B pauses at the very end of the lock's scope, as though the attempt to unlock the mutex is not waking thread A and descheduling thread B for a long time. "

I interpret what you said to indicate that you have more than one mutex being contended for. If each thread is contending for several mutexs simultaneously you can get interlocking conditions. To ensure that you do not get interlocks which result in deadlocks you should follow one of the two following rules.

1. Any thread that locks on a mutex locks on every mutex that it needs all at the same time. This guarentees that you have no deadlocks but it can be a performance killer.

2. All threads that lock on multiple mutexs always do so in the same order. For example if several threads lock on 4 different mutexs (say a b k and j) they all lock on the mutexs in the same order ( a j b k for example).

You can also have a mixture of rules 1 and 2. You could set the rule that all threads lock on a and then later they lock on b, j, and k simultaneously.

But deadlocks are not your problem. Inexplicably long waits are your problem. I suggest that you extend your analysis of lock combinations to multiple mutexs being locked by multiple threads. While you may not be violating my two anti deadlock rules you may be holding mutexes locked longer than you need to.

------------------------------------
Steve Stites

writejus1 · 09-15-2005, 01:54 PM

In case anyone is interested, Fedora Core 2 was the problem. We switched our OS to run a Rocks cluster, and everything is works beautifully now. Quite strangely, Fedora Core 2 "forgets" about threads that want to run. If you 'ps' a seemingly stalled process using the threads option, it will "remind" the OS that the process wants to run.

smurff · 09-16-2005, 04:57 AM

I found a similar thing on RHEL 2.1 the posix threads had issues and now upgrading to RHEL 3 everything works well.
Regards