Stack trace shows a function called itself when there is no recursion. How?
I'm hoping that kernel engineers can help me with a puzzling issue I am encountering.
I have a multi-threaded program that is in a "hung" state. So, upon debugging the program, I found that one thread shows a function twice in the call stack. The function for example is in frames 9 & 10. My program is quite simple does not involve any recursion. I believe this is causing my program to hang, but why is this happening? Is it possible for the call stack of a thread to get corrupted somehow by other threads? Or heap corruption, maybe? What else can cause this?
Greatly appreciate any comments or help. Thanks!
Don't guess. Don't speculate.
You've got a debugger - use it. Just single-step through the code, and observe what happens!
Debugger misunderstands the call stack because the debugger isn't perfect.
Debugger misunderstands the call stack because something corrupted the stack.
The code executed something like the recursion you see as a result of some corrupted memory.
In all cases, there is likely some corrupted memory involved (in the actual failure and probably in the strange stack display).
You may have other reasons (such as the pattern of non reproducibility of the failure) for deciding the bug is more likely cross thread.
If you know assembly language, you generally can look at the asm instructions around the point of failure and those at the start of the function, look at the register values at the point of failure and look at the stack yourself and from all that make a good estimate about what eaxactly was corrupted. Then you usually need to debug again from the start to try to catch the memory corruption in the act.
I don't really know how you chase such bugs if you don't know assembler.
In theory the run time tools for catching writes beyond the end of arrays and similar bugs, should catch a fair fraction of the original bugs leading to such memory clobbers. In practice, my co workers who use such tools to find such bugs usually fail to find them and need me to apply assembler expertise and debugging experience to a more manual approach to solving failures like this.
An example of a cross thread bug fitting this symptom would be passing a pointer to a local object from a function creating a new thread to that thread, then exiting the function so the object no longer has valid memory. If the target thread writes to the object, the original thread might crash with a messed up stack.
That is just one example, since you seem to not understand the sort of thing that might corrupt a stack. There are many possibilities and the above overly specific example is not given as a likely theory.
Thanks for the replies. It is very helpful. But I can elaborate more on the problem and please let me know what you think.
The problem I am presenting here is actually seen by someone else. He just gave me the stack traces of a problem that is very hard to reproduce. He's been seeing it only a few times and requires his program to run for a very long time. So, I did not have the chance to trace the problem using a debugger. But he did give me the traces of all threads during the "hung" state, and while one of the threads shows that two functions are called twice without recursion involved, the main thread shows that a signal handler interrupted a malloc() call and the signal handler calls sem_wait(), which the main thread is now sleeping in. The interesting thing is that the function that shows up twice in the other thread is related to free() and is waiting on a _ll_mutex_lock -- might have been misspelled because I don't have the trace details right now. I know it is bad practice to call sem_wait() in a signal handler (I have emphasized this to the author of the code), and I really think that the interrupted malloc() may be causing this puzzling stack trace. In one of the replies, I see that memory corruption could be a cause. Given the details of this case, what do you think may really be happening here? Can the interrupted malloc() call result in memory corruption and then stack corruption?
Thanks so much for the insight!
Now you're making it sound like a simple deadlock. So I think you are correct that the problem is doing things in a signal handler that are not safe to do there.
No memory corruption is necessary for the deadlock. For deadlock, you just need two resources, such as:
thread A owns resource X
thread B owns resource Y and is waiting for resource X
thread A gets a signal and the signal handler waits for resource Y
I don't know the details of how malloc interacts with multi threading and signals, so I'm not certain, but from your description, I would expect the resource X in the deadlock is something used internally to malloc to control multi thread access to memory management data structures.
|All times are GMT -5. The time now is 10:13 AM.|