[SOLVED] pthread_mutex_lock returning EDEADLK for a mutex of type PTHREAD_MUTEX_RECURSIVE

nikhil_no_1 · 05-28-2010, 04:04 AM

Hi,

I am getting following assertion in my application:

pthread_mutex_lock.c:275: __pthread_mutex_lock: Assertion `(e) != 35 || (kind != PTHREAD_MUTEX_ERRORCHECK_NP && kind != PTHREAD_MUTEX_RECURSIVE_NP)' failed.

Now, all my mutexes are of type PTHREAD_MUTEX_RECURSIVE and as per all the man pages/tutorials, EDEADLK error is to be returned for mutex of type PTHREAD_MUTEX_ERRORCHECK ONLY.
So I really should not be hitting this assertion.

Would some kinda weird memory corruption be causing this? Or is there something more to it that I am not aware of.

I am using linux kernel 2.6.2, glibc 2.5 on PPC.

Thanks in advance.
Nikhil

JohnGraham · 05-28-2010, 05:36 AM

Quote:

Originally Posted by nikhil_no_1

Now, all my mutexes are of type PTHREAD_MUTEX_RECURSIVE and as per all the man pages/tutorials, EDEADLK error is to be returned for mutex of type PTHREAD_MUTEX_ERRORCHECK ONLY.
So I really should not be hitting this assertion.

How does that follow? The API tells you it won't return EDEADLK, it doesn't say anything about not hitting that assertion - the function isn't returning EDEADLK (indeed, it's not returning anything).

Quote:

Originally Posted by nikhil_no_1

Would some kinda weird memory corruption be causing this? Or is there something more to it that I am not aware of.

Well, memory corruption can cause pretty much anything...

Can you reliably reproduce this behaviour? If so, try and strip away the excess parts of the code until you (a) don't get the error or (b) have a very small, simple test-case you can post for us to really help you.

nikhil_no_1 · 05-28-2010, 06:13 AM

Quote:

Originally Posted by JohnGraham

How does that follow? The API tells you it won't return EDEADLK, it doesn't say anything about not hitting that assertion - the function isn't returning EDEADLK (indeed, it's not returning anything).

Yeah, but the fact that we are hitting an assertion means EDEADLK was returned which shouldn't be for my mutex.

Quote:

Originally Posted by JohnGraham

Well, memory corruption can cause pretty much anything...

That's what I want to hear that this is the only explanation coz their is no other explanation.

Quote:

Originally Posted by JohnGraham

Can you reliably reproduce this behaviour? If so, try and strip away the excess parts of the code until you (a) don't get the error or (b) have a very small, simple test-case you can post for us to really help you.

I know I haven't given much information, that's coz this is not a stand-alone application running on standard linux. It's a consumer device. There are a lot of things that happen hence I cannot give a simple test case for it. Even I am struggling to reproduce the issue on my setup. This device has limited debugging capabilities. I have a core file, but most of the information doesn't make sense.

What I want to know is that, could there be any other explanation, apart from memory corruption (simplest one) that can cause such a behavior.

Thanks for your response John. Appreciate it.

JohnGraham · 05-28-2010, 06:34 AM

Quote:

Originally Posted by nikhil_no_1

Yeah, but the fact that we are hitting an assertion means EDEADLK was returned which shouldn't be for my mutex.

Is this apparent from the pthread_mutex_lock.c source code (which I don't have to hand)? Otherwise, I can't see how you can make that link - because the assertion seems to happen within the call to pthread_mutex_lock, it hasn't returned EDEADLK, since it hasn't returned anything - it's asserted and aborted before its time's up.

Quote:

Originally Posted by nikhil_no_1

What I want to know is that, could there be any other explanation, apart from memory corruption (simplest one) that can cause such a behavior.

If you're sure EDEADLK is returned (or about to be returned), have you made sure that all the relevant calls to pthread_mutexattr_{init,settype} are (a) made correctly and (b) have error conditions spotted and dealt with appropriately? If such an error is logged, the logs may show some reason why the PTHREAD_MUTEX_RECURSIVE setting couldn't be used - can't think why, but that's computers for you I guess...

nikhil_no_1 · 05-28-2010, 08:10 AM

Quote:

Originally Posted by JohnGraham

Is this apparent from the pthread_mutex_lock.c source code (which I don't have to hand)? Otherwise, I can't see how you can make that link - because the assertion seems to happen within the call to pthread_mutex_lock, it hasn't returned EDEADLK, since it hasn't returned anything - it's asserted and aborted before its time's up.

I see what you trying to say.
I'm attaching pthread_mutex_lock.c
This is the location of the assert.
258 oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
259 newval, 0);
260
261 if (oldval != 0)
262 {
263 /* The mutex is locked. The kernel will now take care of
264 everything. */
265 INTERNAL_SYSCALL_DECL (__err);
266 int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
267 FUTEX_LOCK_PI, 1, 0);
268
269 if (INTERNAL_SYSCALL_ERROR_P (e, __err)
270 && (INTERNAL_SYSCALL_ERRNO (e, __err) == ESRCH
271 || INTERNAL_SYSCALL_ERRNO (e, __err) == EDEADLK))
272 {
273 assert (INTERNAL_SYSCALL_ERRNO (e, __err) != EDEADLK
274 || (kind != PTHREAD_MUTEX_ERRORCHECK_NP
275 && kind != PTHREAD_MUTEX_RECURSIVE_NP));
276 /* ESRCH can happen only for non-robust PI mutexes where
277 the owner of the lock died. */
278 assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
279
280 /* Delay the thread indefinitely. */
281 while (1)
282 pause_not_cancel ();
283 }
284
285 oldval = mutex->__data.__lock;
286
287 assert (robust || (oldval & FUTEX_OWNER_DIED) == 0);
288 }

I got misled by this code here. I thought this is what should get executed for mutex of type PTHREAD_MUTEX_RECURSIVE.

239 if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
240 {
241 THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
242
243 /* Just bump the counter. */
244 if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
245 /* Overflow of the counter. */
246 return EAGAIN;
247
248 ++mutex->__data.__count;
249
250 return 0;
251 }

However this is where the PTHREAD_MUTEX_RECURSIVE case gets handled right in the beginning.

46 switch (__builtin_expect (mutex->__data.__kind, PTHREAD_MUTEX_TIMED_NP))
47 {
48 /* Recursive mutex. */
49 case PTHREAD_MUTEX_RECURSIVE_NP:
50 /* Check whether we already hold the mutex. */
51 if (mutex->__data.__owner == id)
52 {
53 /* Just bump the counter. */
54 if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
55 /* Overflow of the counter. */
56 return EAGAIN;
57
58 ++mutex->__data.__count;
59
60 return 0;
61 }
62
63 /* We have to get the mutex. */
64 LLL_MUTEX_LOCK (mutex->__data.__lock);
65
66 assert (mutex->__data.__owner == 0);
67 mutex->__data.__count = 1;
68 break;

My mutex is set to:
pthread_mutexattr_settype(&mutexAttrib, PTHREAD_MUTEX_RECURSIVE);
(PTHREAD_MUTEX_RECURSIVE = PTHREAD_MUTEX_RECURSIVE_NP)

So is it correct to say that since it did not go in case PTHREAD_MUTEX_RECURSIVE_NP, means that the mutex data structure was corrupted??

Quote:

Originally Posted by JohnGraham

If you're sure EDEADLK is returned (or about to be returned), have you made sure that all the relevant calls to pthread_mutexattr_{init,settype} are (a) made correctly and (b) have error conditions spotted and dealt with appropriately? If such an error is logged, the logs may show some reason why the PTHREAD_MUTEX_RECURSIVE setting couldn't be used - can't think why, but that's computers for you I guess...

That's a good suggestion. I will check that if I see it again.

Thanks again
Nikhil

JohnGraham · 05-28-2010, 10:23 AM

Quote:

Originally Posted by nikhil_no_1

So is it correct to say that since it did not go in case PTHREAD_MUTEX_RECURSIVE_NP, means that the mutex data structure was corrupted??

It could have been corrupted, or like I said before, just failed to be initialised correctly for whatever reason.

You can double-check the mutexattr hasn't been changed or anything crazy by using pthread_mutexattr_get() after you've initialised the mutex using the attributes (and checked return values, of course).

If that checks out, it's probably time to mail the developers - I can't see any way to extract the pthread_mutexattr_t or relevant information from a pthread_mutex_t, which would be useful to check at each lock to make sure it hasn't changed.

John G

ArthurSittler · 05-30-2010, 05:06 AM

Is it possible that a process dies while it is holding a lock on your mutex?

nikhil_no_1 · 06-02-2010, 06:17 AM

It was mostly a case of mutex data structure getting corrupted.
Because of paucity of time I had to revert the change which was made after which this issue surfaced (some thread priorities were changed).
Now we are not seeing this. Later I will get valgrind to run on this system to really debug this issue.

Thanks John/Arthur for your replies.

Vidhuran · 07-05-2011, 03:57 AM

Nikhil ,
I'm going through the same cycle that you had been through. Difficulty in reproducing the problem , unable to find the root cause of the problem.
Did you find out about the root cause of the error in your case? That might help me too.
But for now , i'm also thinking if i will revert back the changes that were made so that the error doesnt come again.

Thanks
Vidhuran