LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-05-2014, 10:42 AM   #1
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Using ElectricFence ??


I'm trying to use ElectricFence-2.2.2-28.el6.x86_64 to diagnose a memory clobber in a giant multi-threaded program.

Without ElectricFence, about a third of the time, the program seg faults after completion of its work, during destructors of long term objects. It appears that the malloc control structures between allocations have been clobbered, so consolidating free space crashes. But the addresses are different every time, so I can't backtrack to the cause.

Using ElectricFence with default settings, or with EF_PROTECT_BELOW=1 causes the program to complete (slowly) without any faults every time.

So I tried with EF_PROTECT_FREE=1
As soon as the program switches to the serious multi-threaded section (after quite a lot of initial processing) it aborts with:
ElectricFence Exiting: mprotect() failed: Cannot allocate memory

I switched to a machine with 256GB of ram and 128GB of swap. No swap space was ever used. The failure was the same. I understand EF_PROTECT_FREE=1 would tend to make a 32-bit executable exhaust its address space. But this is 64-bit, so it isn't exhausting address space. It sure isn't exhausting physical ram. Is EF_PROTECT_FREE=1 broken for multi-threaded allocations?

Does ElectricFence actually work? This is my first usage of it. I find it hard to believe the memory clobber I'm chasing is one that would not be found by the documented behavior of ElectricFence's default settings. Yet the fault is totally hidden (no seg fault during the destructors) instead of being revealed.


Any suggestions?

Last edited by johnsfine; 08-05-2014 at 10:43 AM.
 
Old 08-05-2014, 11:41 AM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Haven't use it myself, but according to wikipedia:
Quote:
eFence[dead link] – source code – not thread safe
That page also mentions a fork, DUMA. Its README has some references to thread local storage, so presumably it is thread safe.

That README also mentions max_map_count:
Quote:
https://www.kernel.org/doc/Documentation/sysctl/vm.txt

This file contains the maximum number of memory map areas a process
may have. Memory map areas are used as a side-effect of calling
malloc, directly by mmap and mprotect, and also when loading shared
libraries.

While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation.

The default value is 65536.
Possibly you are exhausting this limit.
 
Old 08-05-2014, 12:36 PM   #3
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Original Poster
Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by ntubski View Post
Possibly you are exhausting this limit.
Thankyou. That answered one question. I fixed the failure in EF_PROTECT_FREE=1 with (as root)
Code:
echo 999999 > /proc/sys/vm/max_map_count
So that setting now acts like the other two. It completely hides the memory clobber symptom.

Valgrind gave similar results. It reports hundreds of false reports of memory issues (but none of those are incorrect writes, so even if they aren't false none explain the seg fault) and so far (testing is slow) no seg faults occur when the program is run inside Valgrind.

Last edited by johnsfine; 08-05-2014 at 12:54 PM.
 
Old 08-05-2014, 01:11 PM   #4
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
You have a timing-hole problem in addition to an allocation problem. Consider doing something like setting a mutex or somesuch around the destructor logic to force this cleanup process to occur in series instead of racing.

What's probably happening is that something is being freed without immediately setting that pointer to NULL, resulting in a subsequent reference to a stale pointer. When the pointer is still valid, nothing happens; when it is now-freed storage, a segfault occurs. (But because the pointer is stale, it could have been used to irrevocably clobber something else, so that the root cause of the problem can no longer be found.)

Examine the destructors carefully, including all the inheritance chains relating to destructors, and ensure that, where anything that is disposed-of, the n-e-x-t statement sets that pointer to NULL whether it seems that you "need to" or not.

I have never had particularly good success with memory-fence tools like this because, at best, they can only discover "there's a smoking crater here." They can't predict why that crater exists nor how the memory got to be that way. But it's a dead-certainty that a stale pointer issue will be the root cause.

If you know of any point in the past where the app was reliable, and you have a complete version-control history of it, you might be able to look at the change-logs, scanning that delta'd source code for destructors.
 
Old 08-05-2014, 02:43 PM   #5
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Original Poster
Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by sundialsvcs View Post
Consider doing something like setting a mutex or somesuch around the destructor logic to force this cleanup process to occur in series instead of racing.
Actually the destructor that fails is in an entirely single threaded part of the program as was the constructor of the object whose destructor fails. All of that code (construction and destruction of the long term objects) has been used so many times that there is no possibility the fault is connected to the symptom.

Some random write in an unrelated (and almost certainly multi-threaded) section of the program is trashing something that is then symptom free until that destructor.

Quote:
What's probably happening is that something is being freed without immediately setting that pointer to NULL,
The EF_PROTECT_FREE=1 option in ElectricFence should completely detect all reads or writes using stale pointers. So I think that possible cause of the bug has been ruled out.
 
Old 08-05-2014, 08:08 PM   #6
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
John, my too-well seasoned "take" on this sort of situation is that: "there is almost-certainly nothing wrong with" the particular piece of code that you're looking at. It is, as you well know, "absolutely certain that 'the bug is not here."

Nope ... your assessment is (unfortunately) entirely correct: something, somewhere, is "absolutely trashing memory," well before the point of failure. It could be "anywhere, anytime." But I'll bet my bottom dollar that: [i](a)/i] it will be related to freeing things, and (b) if this application ever was stable, it will be related to a change made since that time.

Let me now relate to you my "$10,000 (hard dollars) tale of woe." It looked like this:
Code:
 a.free;
b.free;
a := nil;
b := nil;
(in Delphi ... in code that I had purchased ...) and the solution looked like this:
Code:
a.free;
a := nil;
b.free;
b := nil;
The root cause of this error was many source-files away, in code that was "thoroughly tested." (Really!) I know perfectly well that you do, of course, see the well-hidden problem . . . as did I . . . as did the vendor that I could not in any good conscience sue . . . too late, $$much$$ too late.

That is what you are dealing with now. And I would be delighted to prescribe that any sort of electric tool will help you find it.

(Sometimes, computer-programming for-a-living s-u-c-k-s ...)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:29 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration