LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-28-2010, 04:04 AM   #1
nikhil_no_1
LQ Newbie
 
Registered: Sep 2004
Posts: 6

Rep: Reputation: 0
pthread_mutex_lock returning EDEADLK for a mutex of type PTHREAD_MUTEX_RECURSIVE


Hi,

I am getting following assertion in my application:

pthread_mutex_lock.c:275: __pthread_mutex_lock: Assertion `(e) != 35 || (kind != PTHREAD_MUTEX_ERRORCHECK_NP && kind != PTHREAD_MUTEX_RECURSIVE_NP)' failed.

Now, all my mutexes are of type PTHREAD_MUTEX_RECURSIVE and as per all the man pages/tutorials, EDEADLK error is to be returned for mutex of type PTHREAD_MUTEX_ERRORCHECK ONLY.
So I really should not be hitting this assertion.

Would some kinda weird memory corruption be causing this? Or is there something more to it that I am not aware of.

I am using linux kernel 2.6.2, glibc 2.5 on PPC.

Thanks in advance.
Nikhil
 
Old 05-28-2010, 05:36 AM   #2
JohnGraham
Member
 
Registered: Oct 2009
Posts: 467

Rep: Reputation: 139Reputation: 139
Quote:
Originally Posted by nikhil_no_1 View Post
Now, all my mutexes are of type PTHREAD_MUTEX_RECURSIVE and as per all the man pages/tutorials, EDEADLK error is to be returned for mutex of type PTHREAD_MUTEX_ERRORCHECK ONLY.
So I really should not be hitting this assertion.
How does that follow? The API tells you it won't return EDEADLK, it doesn't say anything about not hitting that assertion - the function isn't returning EDEADLK (indeed, it's not returning anything).


Quote:
Originally Posted by nikhil_no_1 View Post
Would some kinda weird memory corruption be causing this? Or is there something more to it that I am not aware of.
Well, memory corruption can cause pretty much anything...


Can you reliably reproduce this behaviour? If so, try and strip away the excess parts of the code until you (a) don't get the error or (b) have a very small, simple test-case you can post for us to really help you.
 
Old 05-28-2010, 06:13 AM   #3
nikhil_no_1
LQ Newbie
 
Registered: Sep 2004
Posts: 6

Original Poster
Rep: Reputation: 0
Smile

Quote:
Originally Posted by JohnGraham View Post
How does that follow? The API tells you it won't return EDEADLK, it doesn't say anything about not hitting that assertion - the function isn't returning EDEADLK (indeed, it's not returning anything).
Yeah, but the fact that we are hitting an assertion means EDEADLK was returned which shouldn't be for my mutex.


Quote:
Originally Posted by JohnGraham View Post
Well, memory corruption can cause pretty much anything...
That's what I want to hear that this is the only explanation coz their is no other explanation.

Quote:
Originally Posted by JohnGraham View Post
Can you reliably reproduce this behaviour? If so, try and strip away the excess parts of the code until you (a) don't get the error or (b) have a very small, simple test-case you can post for us to really help you.
I know I haven't given much information, that's coz this is not a stand-alone application running on standard linux. It's a consumer device. There are a lot of things that happen hence I cannot give a simple test case for it. Even I am struggling to reproduce the issue on my setup. This device has limited debugging capabilities. I have a core file, but most of the information doesn't make sense.

What I want to know is that, could there be any other explanation, apart from memory corruption (simplest one) that can cause such a behavior.

Thanks for your response John. Appreciate it.
 
Old 05-28-2010, 06:34 AM   #4
JohnGraham
Member
 
Registered: Oct 2009
Posts: 467

Rep: Reputation: 139Reputation: 139
Quote:
Originally Posted by nikhil_no_1 View Post
Yeah, but the fact that we are hitting an assertion means EDEADLK was returned which shouldn't be for my mutex.
Is this apparent from the pthread_mutex_lock.c source code (which I don't have to hand)? Otherwise, I can't see how you can make that link - because the assertion seems to happen within the call to pthread_mutex_lock, it hasn't returned EDEADLK, since it hasn't returned anything - it's asserted and aborted before its time's up.


Quote:
Originally Posted by nikhil_no_1 View Post
What I want to know is that, could there be any other explanation, apart from memory corruption (simplest one) that can cause such a behavior.
If you're sure EDEADLK is returned (or about to be returned), have you made sure that all the relevant calls to pthread_mutexattr_{init,settype} are (a) made correctly and (b) have error conditions spotted and dealt with appropriately? If such an error is logged, the logs may show some reason why the PTHREAD_MUTEX_RECURSIVE setting couldn't be used - can't think why, but that's computers for you I guess...
 
Old 05-28-2010, 08:10 AM   #5
nikhil_no_1
LQ Newbie
 
Registered: Sep 2004
Posts: 6

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by JohnGraham View Post
Is this apparent from the pthread_mutex_lock.c source code (which I don't have to hand)? Otherwise, I can't see how you can make that link - because the assertion seems to happen within the call to pthread_mutex_lock, it hasn't returned EDEADLK, since it hasn't returned anything - it's asserted and aborted before its time's up.
I see what you trying to say.
I'm attaching pthread_mutex_lock.c
This is the location of the assert.
258 oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
259 newval, 0);
260
261 if (oldval != 0)
262 {
263 /* The mutex is locked. The kernel will now take care of
264 everything. */
265 INTERNAL_SYSCALL_DECL (__err);
266 int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock,
267 FUTEX_LOCK_PI, 1, 0);
268
269 if (INTERNAL_SYSCALL_ERROR_P (e, __err)
270 && (INTERNAL_SYSCALL_ERRNO (e, __err) == ESRCH
271 || INTERNAL_SYSCALL_ERRNO (e, __err) == EDEADLK))
272 {
273 assert (INTERNAL_SYSCALL_ERRNO (e, __err) != EDEADLK
274 || (kind != PTHREAD_MUTEX_ERRORCHECK_NP
275 && kind != PTHREAD_MUTEX_RECURSIVE_NP));
276 /* ESRCH can happen only for non-robust PI mutexes where
277 the owner of the lock died. */
278 assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
279
280 /* Delay the thread indefinitely. */
281 while (1)
282 pause_not_cancel ();
283 }
284
285 oldval = mutex->__data.__lock;
286
287 assert (robust || (oldval & FUTEX_OWNER_DIED) == 0);
288 }


I got misled by this code here. I thought this is what should get executed for mutex of type PTHREAD_MUTEX_RECURSIVE.

239 if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
240 {
241 THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
242
243 /* Just bump the counter. */
244 if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
245 /* Overflow of the counter. */
246 return EAGAIN;
247
248 ++mutex->__data.__count;
249
250 return 0;
251 }

However this is where the PTHREAD_MUTEX_RECURSIVE case gets handled right in the beginning.

46 switch (__builtin_expect (mutex->__data.__kind, PTHREAD_MUTEX_TIMED_NP))
47 {
48 /* Recursive mutex. */
49 case PTHREAD_MUTEX_RECURSIVE_NP:
50 /* Check whether we already hold the mutex. */
51 if (mutex->__data.__owner == id)
52 {
53 /* Just bump the counter. */
54 if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
55 /* Overflow of the counter. */
56 return EAGAIN;
57
58 ++mutex->__data.__count;
59
60 return 0;
61 }
62
63 /* We have to get the mutex. */
64 LLL_MUTEX_LOCK (mutex->__data.__lock);
65
66 assert (mutex->__data.__owner == 0);
67 mutex->__data.__count = 1;
68 break;

My mutex is set to:
pthread_mutexattr_settype(&mutexAttrib, PTHREAD_MUTEX_RECURSIVE);
(PTHREAD_MUTEX_RECURSIVE = PTHREAD_MUTEX_RECURSIVE_NP)

So is it correct to say that since it did not go in case PTHREAD_MUTEX_RECURSIVE_NP, means that the mutex data structure was corrupted??


Quote:
Originally Posted by JohnGraham View Post
If you're sure EDEADLK is returned (or about to be returned), have you made sure that all the relevant calls to pthread_mutexattr_{init,settype} are (a) made correctly and (b) have error conditions spotted and dealt with appropriately? If such an error is logged, the logs may show some reason why the PTHREAD_MUTEX_RECURSIVE setting couldn't be used - can't think why, but that's computers for you I guess...
That's a good suggestion. I will check that if I see it again.

Thanks again
Nikhil
Attached Files
File Type: txt pthread_mutex_lock.txt (12.0 KB, 81 views)
 
Old 05-28-2010, 10:23 AM   #6
JohnGraham
Member
 
Registered: Oct 2009
Posts: 467

Rep: Reputation: 139Reputation: 139
Quote:
Originally Posted by nikhil_no_1 View Post
So is it correct to say that since it did not go in case PTHREAD_MUTEX_RECURSIVE_NP, means that the mutex data structure was corrupted??
It could have been corrupted, or like I said before, just failed to be initialised correctly for whatever reason.

You can double-check the mutexattr hasn't been changed or anything crazy by using pthread_mutexattr_get() after you've initialised the mutex using the attributes (and checked return values, of course).

If that checks out, it's probably time to mail the developers - I can't see any way to extract the pthread_mutexattr_t or relevant information from a pthread_mutex_t, which would be useful to check at each lock to make sure it hasn't changed.

John G
 
Old 05-30-2010, 05:06 AM   #7
ArthurSittler
Member
 
Registered: Jul 2008
Distribution: Slackware
Posts: 124

Rep: Reputation: 31
where are you manipulating your mutex?

Is it possible that a process dies while it is holding a lock on your mutex?
 
Old 06-02-2010, 06:17 AM   #8
nikhil_no_1
LQ Newbie
 
Registered: Sep 2004
Posts: 6

Original Poster
Rep: Reputation: 0
It was mostly a case of mutex data structure getting corrupted.
Because of paucity of time I had to revert the change which was made after which this issue surfaced (some thread priorities were changed).
Now we are not seeing this. Later I will get valgrind to run on this system to really debug this issue.

Thanks John/Arthur for your replies.
 
Old 07-05-2011, 03:57 AM   #9
Vidhuran
LQ Newbie
 
Registered: Jun 2011
Posts: 1

Rep: Reputation: Disabled
Nikhil ,
I'm going through the same cycle that you had been through. Difficulty in reproducing the problem , unable to find the root cause of the problem.
Did you find out about the root cause of the error in your case? That might help me too.
But for now , i'm also thinking if i will revert back the changes that were made so that the error doesnt come again.

Thanks
Vidhuran
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
pthread_mutex_lock segmentation fault satishku Linux - Software 0 08-17-2009 01:52 AM
Why does pthread_cond_wait need a mutex? fuzzyBuzz Programming 4 06-01-2009 02:16 PM
pthread_mutex_lock/unlock SMP and CPU cache coherence on core2 SMP sergxm Linux - Kernel 0 01-27-2009 04:32 PM
Mutex new2lunix Programming 1 12-02-2008 08:12 PM
How does pthread_mutex_lock() lock mutex in pthread icoming Programming 0 12-04-2004 08:54 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration