best methodology to debug a Linux daemon written in C++

aryan1 · 12-27-2009, 01:56 AM

Hi All,

I have a bunch of C++ application which runs as daemons in Linux.

Furthermore, since these daemons rely on each other, they use IPC (inter-process communication) to communicate with each other.

These daemons are compiled with -g -O0 compiler flags, and started in right order using a shell script.

These daemons seems to have buggy behaviour and "sometimes" crash.

Another reason for it to crash might be corrupted input data that it processes.

To find out about the real cause, I tried both gdb and Valgrind to debug it.

I used --leak-check=yes and --log-file=foo.txt options with Valgrind.

However, Valgrind did not report any invalid write/read or uninitialized variable errors which may have caused the crash. FAQ on Valgrind's official website says that this is the nature of Valgrind that you can not change such that it can not replicate the native execution environment.

What is the best debugger options to use with Valgrind or gdb to debug a Linux daemon and find out exactly what is happening in it at the time of crash ?

Should I attach Valgrind or gdb to already running process ? or Is it ok to start the daemon with Valgrind in shell script ?

Thanks.

GooseYArd · 12-27-2009, 07:18 AM

Quote:

Originally Posted by aryan1

Hi All,

Should I attach Valgrind or gdb to already running process ? or Is it ok to start the daemon with Valgrind in shell script ?

Thanks.

I would probably save valgrind until after you've identified where the crash is happening.

Are the daemons set up to produce a core file when they crash? Rather than attach gdb to all the pids, its probably easiest just to make sure that they dump core when they crash, and then use gdb -f to load the core after the crash. You'll be able to use bt to get a backtrace at the time of the crash, without actually having to start up another instance of the daemon.

Do your daemons run setuid as some other user? If so, it's a little tricky to get the kernel to dump core for you- it requires a slight modification to the code- theres a syscall called prctrl (use options PR_SET_DUMPABLE with value 1), to get it to core.

Mara · 12-27-2009, 11:37 AM

If it doesn't crash under valgrind, try under gdb. When it crashes, you can use the 'bt' command and see which path it used. If it crashes only sometimes, however, it would be good to try to get a scenario when it crashes more often. For instance, try to provide incorrect input data.

aryan1 · 12-27-2009, 11:53 AM

Quote:

Originally Posted by Mara

If it doesn't crash under valgrind, try under gdb. When it crashes, you can use the 'bt' command and see which path it used.

Actually, I already tried it under gdb.

However, gdb seems not to be able to catch the operation which causes the application to crash; it says "program exited normally" instead, and "bt" produces no results.

Quote:

If it crashes only sometimes, however, it would be good to try to get a scenario when it crashes more often. For instance, try to provide incorrect input data.

I am not sure if the real cause of the bug is the incorrect input data - that's what I am trying to find out. Since the the amount of test data is big, it is really difficult to come up with a incorrect data pattern.

Maybe the best way is to get the application to generate a core dump, and debug the core dump with either gdb or Valgrind.

What do you think ?

aryan1 · 12-27-2009, 12:00 PM

Quote:

Originally Posted by GooseYArd

Are the daemons set up to produce a core file when they crash?

Actually, I have never set up any application to produce core dump. Yet, I came across "ulimit -c unlimited" command, which lets the application to produce core dumps. Still, the daemon did not produce any core dump.

Quote:

Do your daemons run setuid as some other user?

My daemon first runs fork(), after which setsid() and chroot("/") commands are executed. As far as I know, these are already common operations to create a daemon.

Do you think that my daemon needs the modification that you described ?

Mara · 12-27-2009, 12:10 PM

Quote:

Originally Posted by aryan1

However, gdb seems not to be able to catch the operation which causes the application to crash; it says "program exited normally" instead, and "bt" produces no results.

That's very important! Are you sure it really crashes instead of just exiting? Gdb behavior suggest just exit() or something similar. The direction depends on the source code. Does it use assert() or similar method? If so, is it possible to turn on some debug to show from which point and why it actually exits? Or just overload assert() and friends and add some debug? If not, you my try to use on_exit() and print state of some variable etc you guess may be wrong.

aryan1 · 12-27-2009, 12:41 PM

Quote:

Originally Posted by Mara

That's very important! Are you sure it really crashes instead of just exiting?

Yes, I am sure. This daemon runs in an infinite loop. gdb says "program exited normally". However, when I run it using my shell script, it crashes.

Quote:

Does it use assert() or similar method? If so, is it possible to turn on some debug to show from which point and why it actually exits?

I use syslog to log some important steps in the application. However, as far as I experienced so far, syslog facility is not quite ideal to provide useful info to track down a bug - I send the log message to syslog daemon, and I do not know in detail how it handles these messages.

Let me give you more info on how I start my daemons.

Normally, I start them in order in a shell script as follows:

./deamon1
./daemon2
./daemon3

This way of starting daemons cause a crash in daemon2. (daemon1, 2 and 3 use IPC to communicate to each other)

For debugging, I modify the above script into the following:

./deamon1
gdb daemon2 (or valgrind --leak-check=yes --log-file=test --trace-children=yes daemon2)
./daemon3

Neither gdb nor valgrind identifies any possibly bad operation that may cause a crash.

Somehow, the second way of starting the daemons creates an execution environment that is different from the native execution environment.

Mara · 12-28-2009, 01:26 PM

If it looks this way, try to enable core dumps by using
ulimit -c unlimited
Maybe the core file will show something.

GooseYArd · 12-28-2009, 01:48 PM

Have you tried setting a breakpoint in exit()? Unless the program is exiting via a syscall to exit, you should be able to break in libc exit() and get a backtrace there.

gdb 7.0 also has some insane capabilities you might like- you can do step backward debugging, or do "catch syscall" if the program is exiting via syscall exit.

aryan1 · 12-29-2009, 12:51 AM

Quote:

Originally Posted by GooseYArd

Have you tried setting a breakpoint in exit()? Unless the program is exiting via a syscall to exit, you should be able to break in libc exit() and get a backtrace there.

Based on the diagnostic messages the application prints out, I can say that it does not exit via a syscall to exit.

How can I set a breakpoint in libc exit() ? How is libc exit() different from a syscall to exit ?

GooseYArd · 12-29-2009, 07:36 AM

b exit

will do the trick.

wje_lq · 12-29-2009, 09:34 AM

Please do not be angry at this suggestion.

If you have a hunch that a daemon might be crashing because of invalid input data, and there's too much data to comb through, maybe you've already considered the possibility of sketching out, on paper, just what you require of the data for it to be valid, show those requirements to a local friend, and have him look over the parts of your code that check for data validity.

If not, then it's something to think about. Could save a lot of time.

ta0kira · 12-30-2009, 05:01 AM

I personally think multi-process systems integrated across IPC are a nightmare to debug with debuggers. I've had my best luck with good-old fprintf to piece together the chain of events and data states resulting in a crash. One thing leads to another and eventually I get to the bottom of it. This is an effective method, even with hundreds of source files across 10 or 20 libraries and as many programs, so long as you know your sources well. Knowing where the crash is is only half of it.
Kevin Barry