Using ElectricFence ??

johnsfine · 08-05-2014, 10:42 AM

I'm trying to use ElectricFence-2.2.2-28.el6.x86_64 to diagnose a memory clobber in a giant multi-threaded program.

Without ElectricFence, about a third of the time, the program seg faults after completion of its work, during destructors of long term objects. It appears that the malloc control structures between allocations have been clobbered, so consolidating free space crashes. But the addresses are different every time, so I can't backtrack to the cause.

Using ElectricFence with default settings, or with EF_PROTECT_BELOW=1 causes the program to complete (slowly) without any faults every time.

So I tried with EF_PROTECT_FREE=1
As soon as the program switches to the serious multi-threaded section (after quite a lot of initial processing) it aborts with:
ElectricFence Exiting: mprotect() failed: Cannot allocate memory

I switched to a machine with 256GB of ram and 128GB of swap. No swap space was ever used. The failure was the same. I understand EF_PROTECT_FREE=1 would tend to make a 32-bit executable exhaust its address space. But this is 64-bit, so it isn't exhausting address space. It sure isn't exhausting physical ram. Is EF_PROTECT_FREE=1 broken for multi-threaded allocations?

Does ElectricFence actually work? This is my first usage of it. I find it hard to believe the memory clobber I'm chasing is one that would not be found by the documented behavior of ElectricFence's default settings. Yet the fault is totally hidden (no seg fault during the destructors) instead of being revealed.

Any suggestions?

ntubski · 08-05-2014, 11:41 AM

Haven't use it myself, but according to wikipedia:

Quote:

eFence[dead link] – source code – not thread safe

That page also mentions a fork, DUMA. Its README has some references to thread local storage, so presumably it is thread safe.

That README also mentions max_map_count:

Quote:

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

This file contains the maximum number of memory map areas a process
may have. Memory map areas are used as a side-effect of calling
malloc, directly by mmap and mprotect, and also when loading shared
libraries.

While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation.

The default value is 65536.

Possibly you are exhausting this limit.

johnsfine · 08-05-2014, 12:36 PM

Quote:

Originally Posted by ntubski

Possibly you are exhausting this limit.

Thankyou. That answered one question. I fixed the failure in EF_PROTECT_FREE=1 with (as root)

Code:

echo 999999 > /proc/sys/vm/max_map_count

So that setting now acts like the other two. It completely hides the memory clobber symptom.

Valgrind gave similar results. It reports hundreds of false reports of memory issues (but none of those are incorrect writes, so even if they aren't false none explain the seg fault) and so far (testing is slow) no seg faults occur when the program is run inside Valgrind.

sundialsvcs · 08-05-2014, 01:11 PM

You have a timing-hole problem in addition to an allocation problem. Consider doing something like setting a mutex or somesuch around the destructor logic to force this cleanup process to occur in series instead of racing.

What's probably happening is that something is being freed without immediately setting that pointer to NULL, resulting in a subsequent reference to a stale pointer. When the pointer is still valid, nothing happens; when it is now-freed storage, a segfault occurs. (But because the pointer is stale, it could have been used to irrevocably clobber something else, so that the root cause of the problem can no longer be found.)

Examine the destructors carefully, including all the inheritance chains relating to destructors, and ensure that, where anything that is disposed-of, the n-e-x-t statement sets that pointer to NULL whether it seems that you "need to" or not.

I have never had particularly good success with memory-fence tools like this because, at best, they can only discover "there's a smoking crater here." They can't predict why that crater exists nor how the memory got to be that way. But it's a dead-certainty that a stale pointer issue will be the root cause.

If you know of any point in the past where the app was reliable, and you have a complete version-control history of it, you might be able to look at the change-logs, scanning that delta'd source code for destructors.

johnsfine · 08-05-2014, 02:43 PM

Quote:

Originally Posted by sundialsvcs

Consider doing something like setting a mutex or somesuch around the destructor logic to force this cleanup process to occur in series instead of racing.

Actually the destructor that fails is in an entirely single threaded part of the program as was the constructor of the object whose destructor fails. All of that code (construction and destruction of the long term objects) has been used so many times that there is no possibility the fault is connected to the symptom.

Some random write in an unrelated (and almost certainly multi-threaded) section of the program is trashing something that is then symptom free until that destructor.

Quote:

What's probably happening is that something is being freed without immediately setting that pointer to NULL,

The EF_PROTECT_FREE=1 option in ElectricFence should completely detect all reads or writes using stale pointers. So I think that possible cause of the bug has been ruled out.

sundialsvcs · 08-05-2014, 08:08 PM

John, my too-well seasoned "take" on this sort of situation is that: "there is almost-certainly nothing wrong with" the particular piece of code that you're looking at. It is, as you well know, "absolutely certain that 'the bug is not here."

Nope ... your assessment is (unfortunately) entirely correct: something, somewhere, is "absolutely trashing memory," well before the point of failure. It could be "anywhere, anytime." But I'll bet my bottom dollar that: [i](a)/i] it will be related to freeing things, and (b) if this application ever was stable, it will be related to a change made since that time.

Let me now relate to you my "$10,000 (hard dollars) tale of woe." It looked like this:

Code:

 a.free;
b.free;
a := nil;
b := nil;

(in Delphi ... in code that I had purchased ...) and the solution looked like this:

Code:

a.free;
a := nil;
b.free;
b := nil;

The root cause of this error was many source-files away, in code that was "thoroughly tested." (Really!) I know perfectly well that you do, of course, see the well-hidden problem . . . as did I . . . as did the vendor that I could not in any good conscience sue . . . too late, $$much$$ too late.

That is what you are dealing with now. And I would be delighted to prescribe that any sort of electric tool will help you find it.

(Sometimes, computer-programming for-a-living s-u-c-k-s ...)