GCC update to another version

selfprogrammed · 03-20-2024, 06:18 AM

I am now up the 49'th version of modified code, and still cannot identify exactly what is happening. It could still be a compiler issue, or possibly a strange coding error. I have been over that code so many times that I think I would have found a coding error by now.

It is still consistent in where it faults and how it faults. I have added more instrumentation to check possible faults.
The one deallocation fault is still there, is still happening around 2 times per 2 hours (it will not occur sooner), with an error message of "corrupted or double deallocation" when trying to free a some particular vectors. These vectors are subject to a swap operation, from another thread, protected by a Lock.

I have detected that another vector segfault occurred after several layers of instrumentation had verified the object repeatedly. Examination revealed an object with random data. It appears that the "this" ptr was corrupted in the middle of the function, so I have added instrumentation to detect that.
Of course the latest runs, do not detect anything, yet.
That is why, I must still consider that this may be a compiler fault.

I have obtained a copy of gcc 12.3. I just have to figure out how to install it without compromising the existing GCC package.

kgha · 03-20-2024, 06:22 AM

Quote:

Originally Posted by selfprogrammed

I just have to figure out how to install it without compromising the existing GCC package.

See https://gcc.gnu.org/faq.html#multiple

selfprogrammed · 03-22-2024, 06:04 AM

I know this is going on an on and is becoming an exercise in discovering what the voxelands programmers did.
I did a test of the memory allocation.
I have replaced several of the std::vector uses with a derived version with some instrumentation in it.
I added to that a test array allocation of a small array of bytes.

The program now faults on those allocations and deallocations, in about 2 seconds, with the same kind of messages I was getting before.
There is very little that I can see that could go wrong with this allocation and deallocation. I even NULL the ptr after delete.
The difference between this and the actual vector, is that I deallocate and reallocate with every length change, so to finb allocation problems much sooner.
The size of the byte array is the same as the length of the vector, about 16 bytes.

This program creates a structure that has new data for the database. It uses a thread to do the actual update, and that thread does the deallocation of the update data, which was originally created in the client program.

In this environment (Linux), are stack allocations and deallocations thread safe, and can they deallocate in a different thread than it was allocated ?
My experience is with different hardware, and it could be either valid or absolutely NOT.

selfprogrammed · 03-28-2024, 08:41 AM

I have gcc 12.3 compiler installed. I compiled itself 3 times, just to make sure that last two times were the same. It took around 5 to 6 hours of computer time. After all that, I am taking it as a comfirmation that my machine does not have a physical fault.

Have not had the chance to try it out as I have found a whole new problem with the code.
It threw up another error that I had not seen before, and so I had to investigate. Now I am stuck trying to deal with it.

In another file, the program has an array of "Mesh" blocks, and is trying to use "memcpy" to copy part of the array to another place.
These contain the std::vector that is such a problem. Those std::vector have some internal allocation data structure that is faulting in the destructor.
The compiler is putting out a "warning" message about the memcpy, and say to make something else in the structure.
That is not easy to do.

From my analysis of the stack at the fault (I get about 4 or 5 of these to analyze every day), it would be entirely consistent that it had been copied using memcpy from some other source.
The internal ptrs of my debugging copies are wrong for the current instance of the data.

I expect that in some previous compiler version that an array of such classes could be copied using memcpy, and now they cannot.
With all this behind-the-back allocation and stl secret data, they are just making everything more and more fragile.
It is no wonder that it looks like the compiler is part of the problem. In a way, it is the stl implementation that comes with the compiler.

BrunoLafleur · 03-28-2024, 09:02 AM

Quote:

Originally Posted by selfprogrammed

I have gcc 12.3 compiler installed. I compiled itself 3 times, just to make sure that last two times were the same. It took around 5 to 6 hours of computer time. After all that, I am taking it as a comfirmation that my machine does not have a physical fault.

Have not had the chance to try it out as I have found a whole new problem with the code.
It threw up another error that I had not seen before, and so I had to investigate. Now I am stuck trying to deal with it.

In another file, the program has an array of "Mesh" blocks, and is trying to use "memcpy" to copy part of the array to another place.
These contain the std::vector that is such a problem. Those std::vector have some internal allocation data structure that is faulting in the destructor.
The compiler is putting out a "warning" message about the memcpy, and say to make something else in the structure.
That is not easy to do.

From my analysis of the stack at the fault (I get about 4 or 5 of these to analyze every day), it would be entirely consistent that it had been copied using memcpy from some other source.
The internal ptrs of my debugging copies are wrong for the current instance of the data.

I expect that in some previous compiler version that an array of such classes could be copied using memcpy, and now they cannot.
With all this behind-the-back allocation and stl secret data, they are just making everything more and more fragile.
It is no wonder that it looks like the compiler is part of the problem. In a way, it is the stl implementation that comes with the compiler.

Yes it is probably a good catch. But in C++ memcpy has never been a good way (even in very very old compilers) to copy objects because if objects have internal classes members and/or pointeurs, only the pointers are copied and not the data themselves. And duplicates of pointers is bad practice (because of aliasing and because we could deallocate via one copy and forget with the other copy. Also threading is adding some more complexity). Each pointer has its own semantic that depend on the class which is used. So for copying we must rely on constructors and destructors even on arrays of objects.

And you are probably right in saying memcpy comes from a version where they were no STL. But it is not the fault of STL but from the one who did the conversion to the STL : the fix was too quick. When porting to STL, it is necessary to rethink the code or to rewrite without STL (it is possible because we don't always need the complexity and genericity of the STL lib). More specialized and simpler code is often enough.

selfprogrammed · 04-05-2024, 03:34 PM

Compiled with the gcc 12.3 compiler. Of course the behavior changed, but I have still seen at least one of the same deallocation faults.

I not sure about the memcpy vrs class problem. Due to this program using slight variations of the same naming, it is easy to mistake one class for another.
The compiler was complaining about using memcpy on this array of class instances.
This particular class content was entirely integer codings. It was referring to the mesh block only by id number (index).
It looks like the type of class that could be copied using memcpy, if the compiler was not doing something sneaky.
It did have a constructor, and a copy constructor, which it did not get to use where memcpy was used. That could be what is annoying the compiler.

This project has had too many maintainers, and was forked several times.
But, I think it was always C++ and is so thick with STL, that I think that it was always stl based.
What I was saying is that the program is from a previous version of gcc, where the stl implementation and compiler checking was such that
memcpy might have worked without problems. I think they would have verified that at the time that code got written.
Due to the sneaky extra fields saved by the new implementations, there have been usages that have become broken. It is usually some line in the compiler release notes that stl containers can no longer do such and such. Hardly anybody ever catches this in their code from reading the compiler release notes.

selfprogrammed · 05-23-2024, 08:56 PM

I am updating this because it has gotten more weird. I am still hoping that someone will test the voxelands on their machine and report if it behaves as I describe or not.

On the current modification, e57, I have added some modified canary checking to allow me to identify the source of the problem. I am more convinced that there is some wild write somewhere, but in a program this large, all I can see are the targets getting hit.

The latest is a segfault on mesh.

Code:

   if( mesh != NULL ) {
       mesh->drop();  // segfaults here, gdb showed that mesh = 0
   }

How it managed to reach that stmt with a NULL, I have not discovered.
It happened a second time, too. That exact same stmt, but this time mesh was 0x23, and the rest of the this structure was random junk including the carnaries.
I could not see where that happened, or what had happened.
It must have been called originally with a valid this ptr, because there were checks on valid canaries at the start of that function.

- this ptr could have been trashed.
- the this structure got overwritten with junk
- the stack was hit, and all the function temps are trashed
- hardware CPU fault
- memory fault

Note: that other programs are not showing such weird behavior.
I ran a memory test for most of the night a couple months ago.
It hit the exact same line in the same function twice now, something that generally does not happen with hardware faults.

Overall, the program is working much better than before. I can now run it for hours without a segfault (but with significant database oddities).
As I am only adding canaries and debugging code, I cannot see what is fixing anything.

Changing to the new compiler has not fixed the problem, just new variations on weirdness.
I think the weirdness is somehow related to compiler issues, but what issues and exactly how has not been discovered.
It may be that these compiler instances are not entirely thread safe.
The notable difference from most other programs is that voxelands starts threads to do the database update, while the client program is still accessing the database.
It does use a lock, which brings up the question if the compiler implementation has changed in some way to make the locks fail to protect.

I can see database issues during the run, such as parts of the world not being drawn, and then reappearing.
Also, sometimes I find a block within one of my constructs that has changed type, and I know that I did not do that.

This may just be a badly written program, but I work on similar programs, and if the compiler is failing in someway, I need to know how and what has to be done to avoid the failure.

pan64 · 05-24-2024, 01:53 AM

Quote:

Originally Posted by selfprogrammed

The latest is a segfault on mesh.

Code:

   if( mesh != NULL ) {
       mesh->drop();  // segfaults here, gdb showed that mesh = 0
   }

How it managed to reach that stmt with a NULL, I have not discovered.

And obviously we can't help, posting a single line of code is pointless. This is not exact location of the problem, but where the program execution could not continue any more because of an earlier failure.

Quote:

Originally Posted by selfprogrammed

It happened a second time, too. That exact same stmt, but this time mesh was 0x23, and the rest of the this structure was random junk including the carnaries.
I could not see where that happened, or what had happened.

Yes, this kind of errors cannot be easily detected, because it is just a side effect of some other problem.

Quote:

Originally Posted by selfprogrammed

It must have been called originally with a valid this ptr, because there were checks on valid canaries at the start of that function.

That is irrelevant. I mean it is possible this ptr was valid originally, but that value is lost.

Quote:

Originally Posted by selfprogrammed

Code:

- this ptr could have been trashed.                          # YES
- the this structure got overwritten with junk.              # YES
- the stack was hit, and all the function temps are trashed  # probably
- hardware CPU fault                                         # NO
- memory fault                                               # NO

Quote:

Originally Posted by selfprogrammed

Note: that other programs are not showing such weird behavior.

It is a tipical coding issue, other programs execute other code.

Quote:

Originally Posted by selfprogrammed

Changing to the new compiler has not fixed the problem, just new variations on weirdness.

Obviously coding errors cannot be fixed by replacing the compiler.

Quote:

Originally Posted by selfprogrammed

I think the weirdness is somehow related to compiler issues, but what issues and exactly how has not been discovered.
It may be that these compiler instances are not entirely thread safe.

That is definitely wrong. You can't prove it is a compiler error (in a reproducible way), and it is more or less impossible. Using an unreliable compiler will kill the whole linux world (since this compiler is used everywhere).

Quote:

Originally Posted by selfprogrammed

The notable difference from most other programs is that voxelands starts threads to do the database update, while the client program is still accessing the database.
It does use a lock, which brings up the question if the compiler implementation has changed in some way to make the locks fail to protect.

And again, it is wrong, it is not a problem with a lock (as long as you cannot prove it).

Quote:

Originally Posted by selfprogrammed

This may just be a badly written program, but I work on similar programs, and if the compiler is failing in someway, I need to know how and what has to be done to avoid the failure.

It is definitely a badly written program. I wrote a lot of programs already (too), some of them were better, others were just badly written non-working trials.

In your case, it's still an illegal memory write when your code overwrites an area of memory that belongs to other parts of your code or variables. That's all. Valgrind is definitely a good tool to catch them.
You need to identify where and when this pointer (memory area) was overwritten. It is definitely hard, because it is not intentional.

henca · 05-24-2024, 02:21 PM

Quote:

Originally Posted by selfprogrammed

The latest is a segfault on mesh.

Code:

   if( mesh != NULL ) {
       mesh->drop();  // segfaults here, gdb showed that mesh = 0
   }

How it managed to reach that stmt with a NULL, I have not discovered.

Does your program have multiple threads? If some other thread sets that mesh variable to NULL between the if-line and the call to the drop function you might need to protect the mesh variable with a mutex or some other kind of locking mechanism.

Otherwise, what does that drop function do? Is the drop function capable of altering the mesh variable?

regards Henrik

BrunoLafleur · 05-24-2024, 03:32 PM

Quote:

Originally Posted by henca

Does your program have multiple threads? If some other thread sets that mesh variable to NULL between the if-line and the call to the drop function you might need to protect the mesh variable with a mutex or some other kind of locking mechanism.

Otherwise, what does that drop function do? Is the drop function capable of altering the mesh variable?

regards Henrik

Yes drop is a reference counted delete of the object pointed. So if the count is 0, mesh is really freed.

I have tried to launch valgrind on a binary compiled with -g of voxelands. It works and print usual log but is very slow as usual. I had some messages but nothing like illegal read or write. But probably it can't simulate enough time inside the world to catch a bug. It can also be a lack of guard in threads, some thread deallocates and other read deallocated area.

There is a lot of drop in the code. It comes from the lib irrlicht which voxelands uses. The includes from irrlicht warns for some misuses of the drop method.

pan64 · 05-25-2024, 03:13 AM

valgrind has an option to do those checks, see here: https://stackoverflow.com/questions/...e-in-c-program
without that it won't do that.
https://courses.cs.vt.edu/cs3214/fal...pc-manual.html
And obviously we have other checkers, like clang.
Installing, configuring and running one or more slow checker(s) is still much faster than to find it without them. Especially in case of external code. Oh yes, and more reliable.

henca · 05-25-2024, 06:46 AM

By default, if not --tool is given to valgrind, valgrind will search for memory errors like uninitialized memory or use after free.

--tool=helgrind might be useful to search for race conditions.

There are quicker ways than valgrind to debug both memory errors and race conditions. Clang has already been mentioned, but I would also like to mention the sanitizer options in gcc. In your case, the address sanitizer and the thread sanitizer might be useful. The drawback of those sanitizers is that your program will have to be recompiled and you can only compile with one chosen sanitizer each time.

regards Henrik

BrunoLafleur · 05-25-2024, 08:18 AM

Quote:

Originally Posted by pan64

valgrind has an option to do those checks, see here: https://stackoverflow.com/questions/...e-in-c-program
without that it won't do that.
https://courses.cs.vt.edu/cs3214/fal...pc-manual.html
And obviously we have other checkers, like clang.
Installing, configuring and running one or more slow checker(s) is still much faster than to find it without them. Especially in case of external code. Oh yes, and more reliable.

Valgrind is configured correctly. But as said in this thread the bug take times to manifest in a play game. So with valgrind which is very slow, the bug can't manifest easily and it is very unlikely to see it this way. For what I see above and when running it could be a thread problem. Some mutex or atomic variables are not used where they should be. It is not so easy to diagnose. But a segfault in a counted references delete method could be a symptom. Note that the compiler is not responsible for that. It is up to the programmer to take care of how the program access shared variables in a thread safe software.

Here I will track the drop method (I don't remenber where it is defined) and verified if is thread safe. Else I could protect the counter.

pan64 · 05-25-2024, 09:48 AM

it will take time definitely, but as I told it can be still much faster with valgrind (to catch it). The slowness means (unfortunately) that this is a rare case, an extremely difficult situation to figure out.
For multithreaded programs you have to use multithreaded libraries or take care about all the possible thread conflicts by yourself.
In a debugger you can watch a variable and catch every read/write access to it, probably it can help.

henca · 05-25-2024, 07:07 PM

One more nice debugger is rr: https://slackbuilds.org/repository/15.0/development/rr/

With its chaos mode it is rather likely to trigger race conditions and once triggered you can use a gui to step not only forward, but also backwards in the recorded execution of the program. You really want a gui for rr, I have used https://www.gdbgui.com/ myself, but that gui does not have any slackbuild at slackbuilds.org.

regards Henrik