LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 05-27-2024, 04:43 PM   #61
selfprogrammed
Member
 
Registered: Jan 2010
Location: Minnesota, USA
Distribution: Slackware 13.37, 14.2, 15.0
Posts: 641

Original Poster
Rep: Reputation: 156Reputation: 156

The point of that particular piece of code:
Code:
   if( mesh != NULL ) {
       mesh->drop();  // segfaults here, gdb showed that mesh = 0
   }
1. The segfault occurred before calling drop.
2. There is no reason to believe that there was some usage of mesh after drop was called and still within that line. Such is not evident in the source code.
3. That drop may destroy much structure within mesh, but it should not be able to touch the "this" pointer itself. I wish I could say that it cannot, but I can only say that I do not know of any way that the drop function, no matter what it does, could possibly damage the source of its "this" ptr. That this ptr would at that time be in a register, or passed on the stack as a parameter, and even having drop trying to modify "this" should not touch it.
4. The test for NULL was immediate before the usage that segfaulted. What could have possibly happened between the test and usage. Very few possibilities, and the most obvious is another thread hit the stack variables.
5. The most likely issue is some compiler implementation issue with code written for a much older version.

---
I have instrumented a derived stl::vector with canaries, and tracing. This is almost as good as what Valgrind would do, but lacks keeping the freed blocks from reuse. I might be able to simulate that too, but not detect it. I have markers on the canaries for detecting using a block after it has been freed. They do not trigger. Upon the segfault, I sometimes see that the whole structure is trash data, but the explicit checks in the code for that do not get a chance to catch it.

It must be that the fatal consequences of "whatever" are rather immediate. It either corrupts the database in a non-detectable way, or it hits the stack with immediate segfault.
This would mean that the trigger should be somewhere in that function, close by to where it is segfaulting.
The problem is that there is too much hidden code, from irrlicht, and from the compiler implementing those std::vector.

---
It is not possible or even reasonable to eliminate memcpy in C++ programs.
I have found it used within the stl::vector implementation, and that is all classes and structs.
What is not clearly differentiated in the language (or this version of the language) is exactly how to determine if it is safe to use. The compiler seems to know, or believe, something, that it uses to give warnings (that seem to be overly aggressive in many cases).


---
I would like to try valgrind again, but it is a huge investment in time running it for around 3 hours hoping for a fault that might reveal some information. More likely it will run out of memory again.
If it really needs more SWAP, then it is not possible to run it and have to play it too. If it is swapping, it would take days to encounter the first fault, and I would have to be there playing it to stimulate the conditions.
I think that I am maxed out on memory on this machine, but I would have to pull it apart to check the memory slots themselves (not as easy as you may be imagining because it is quite buried under layers of other equipment due to limited space here).

---
Another reason for hoping that someone else runs the voxelands, is to test against the software installation, and if it is related to anything that is not the voxelands source code itself.
Such a test could eliminate huge amounts of consideration and probably the need for this to be a linux-questions consideration.

---
Thank you for the info on "rr" debugger. Do not know when I could get time to install that and learn how to use it. It is another consideration.
I may be able to do something similar by putting sleep stmts with random times into the voxelands code for those threads.

---
The most significant info I have right now is that the fault manifests in one particular function right now, consistently. There have been a couple of other faults too, but that particular function.
Altering the function may make the bug move again.
My input must be random data to it, so, it is hitting this function in spite of input randomness.

---
A new fault that I have been seeing the last few weeks is "bad alloc".
This occurs when trying to allocate a std::vector, usually in that same mesh function.
I am assuming that this is due to the fault hitting the heap and corrupting it.

---
It is bothering me that this problem may be due to a wild write, and that wild write is managing to hit the stack, the heap, and the database, but it is not managing to write using a ptr that would segfault immediately, and thus point directly at the problem code.
Why is that. The bad ptr must not be completely random. So what is it that can hit heap and stack. Possibly an variable that points to a stack location after it has been reused.
The possible values of the reuse are not completely random.

---
Please note that there have been bugs found in these compilers in the past. They are found by someone digging and digging to find a cause of something strange, and eventually being able to prove it is something that the compiler does under a specific stimulation. It always requires a specific stimulation, and that is why I cannot let go of this voxelands, as it may be the specific stimulation that provokes the compiler implementation to fail.
Whether it gets fixed or just written up as a known issue and a warning to DO-NOT-DO-THIS would be something to be determined much later.

Last edited by selfprogrammed; 05-27-2024 at 05:25 PM.
 
Old 05-28-2024, 01:41 AM   #62
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,109

Rep: Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367
again, if you can prove and show the bug of the compiler, I will believe it, otherwise the problem is definitely not with the compiler, but the code. (despite the fact the compilers can be buggy too).
Yes, there is a memory problem in your code, somewhere you overwrite something either in the heap or stack or you use just an uninitialized area.
It is happening somewhere else, not at that call to drop, repeating that line will not help on anything. It is just the location where it tried to use that corrupted memory and died.
If the code dies before the call (drop) then it is completely irrelevant, it is the variable mesh itself which is corrupted.
Using memcpy is not the standard way in c++, although there can be cases when it is used. You don't need to eliminate it from anywhere, but you need not use it in your code.

Hoping, explaining, guessing and ruminating on these possibilities will not help to solve anything.
Did you try clang? Do you have compiler warnings?
A relatively simple way to improve the situation is to introduce some (large enough) dummy variables (next to mesh) which will force the compiler to generate different code and relocate variables. In such cases the memory corruption may occur at a different address, causing a different issue, or if you are lucky it will alter that dummy variable. But I wouldn't say it is a correct solution.
 
Old 05-30-2024, 05:00 PM   #63
selfprogrammed
Member
 
Registered: Jan 2010
Location: Minnesota, USA
Distribution: Slackware 13.37, 14.2, 15.0
Posts: 641

Original Poster
Rep: Reputation: 156Reputation: 156
To pan64: This thread is not about convincing you or anyone else that the compiler is buggy. If you are convinced that there is no chance that the compiler that we all rely upon cannot possibly be doing anything that contributes to this problem, then You can safely stop following this thread.
You are repeating things that already have been tried.


It being that I now feel that I need to justify why this thread continues, there will be a slight side-track.
Sorry to those who are not interested.

---
I have encountered buggy compilers before.
We had one back in 1984'ish (I believe that it was a VAX PASCAL compiler) that in one particular function would not execute the way the code was written. However, if DEBUG was enabled, it would generate code that being slightly different would execute perfectly correctly. Try to debug THAT.
There were discussions about the compiler being fixed, and when our work-around was no longer needed.

I have been through all the years of not having compilers, and having to hand code, of just having assemblers, and of having compilers that really needed work.

Some of the symptoms are familiar, especially the way every touch of the problem code area just makes the bug wriggle away to be hidden somewhere else.
What I write here is a summary of the interesting parts of this debugging. If this was just a debugging issue with thread safe code and otherwise bad code, I would not be keeping this thread alive.
If normal appearing code cannot be trusted to execute reasonably, if it has gotchas that we do not know about, then it becomes a safety and saneness issue for everyone using these compilers.

---
I wish the CLANG test had worked.
If anyone wants to contribute something, figure out how get that slackbuild and CMAKE to accept CLANG as the compiler, and generate working code. I got code that would not execute.
The problem is not CLANG, it is the slackbuild system, and that the a problem is likely buried in the CMAKE somehow.
It is also not the issue I was trying to debug, and I know about getting side-tracked with things like this.

---
In the latest, I added some tests to detect that m_mesh pointer segfault discussed earlier, to see just how early I could detect a bad m_mesh pointer.
I added a function to detect bad pointers, and several uses of it to the problem function, watching m_mesh.
Now the problem has moved and I do not get that segfault anymore (over 2 sessions of debugging).
All the weird database issues, and the erratic display, along with a couple of "bad alloc" faults, remain.

Please note that if a wild write was causing this fault, and was reliably hitting the same function (as seen), then such a simple change would not just cause it to go away.
Adding such instrumentation checks just moved some code around. It should not make the wild write go away, it ought to write close by and in a similar fashion. If it was hitting the m_mesh variable on the stack, it should still be hitting the stack someplace. The code causing the wild write was not changed by the debugging stmts. Perhaps the stack location of m_mesh moved, but some other var would be hit instead, so I should see that now.

So why did the symptom change so radically. Please do not give me a long list of how-to-debug, that I have know for longer than most of you. I know about that, and it is not sufficient to reject the suspicious nature of this.

I know that the voxelands code is doing something wrong. But it is something that is not easy to recognize, and needs to be documented as a gotcha. It may be something fixable. The compiler people seem to think that the can fix all kinds of loose coding practices, and after it is all figured out this is likely to be of the same nature. Up until then, I have a badly behaving program that defies debugging it.
That in itself is a compiler design issue. Like trying to debug APL. It becomes an unusable tool that cannot be trusted outside safe usages (like no-threading).

This reminds me of another compiler problem I had encountered, that had an added capability enabled by a compiler feature, that used a register, and the compiler optimization did not know about it, but it was only used where it did not do harm. And then several years later, after expansions of the usage, and several compiler upgrades, it is now a problem and is most difficult to track down. We got lucky on that one due to being able to force it to a locality, and there being a visible entity whose functioning could be questioned and explored.

There appears to be a correlation between adding debug to the code, and the appearance of the faults.
These debug stmts are not changes to the actual game code. It is responding radically to debugging code additions that affect code generation, optimization, and memory locations.
The bug is too sensitive to code generation and code location.

It is possible, also, that there is a ptr, to a ptr, to a random location, that flails wildly with the slightest touch.
But that is also surviving other debugging issues, only to wriggle away the instant a debugging stmt gets too close to detecting it early.
It is unusual for even a ptr to a ptr kind of random ptr generation to behave like this, unless the debugging stmts are right on top of some critical memory reference.

It also must be remembered, that this fault only manifests after considerable play. It is not just ptr to a random location. It does not ever segfault on a bad ptr where I could clearly identify the wild write. Never does the wild write generate a segfault of its own directly, which after this much exercise it ought to have done.
It is stable and generates the same indirect segfaults, and "bad alloc" over several debugging instances. It is stable in its behavior, up until debugging gets close to something sensitive.
It is too sensitive to putting a few debugging stmts between some other stmts, and this makes me suspicious.
So this thread continues.

---
Nay-saying is not helping. If you are like that, then please stop reading this thread.
There is much experience here and someone may recognize a symptom and may be able to help greatly.

I am hoping for some useful help that adds some independent observations of the code behavior under other environments.
Does anyone know what is up with CLANG and CMAKE, and that slackbuild system, and how they could generate a non-functional package.
There may be some installation issue on my system with that. Except, that I have used CLANG on other projects and compiles.
I have verified that CLANG does generate executable code for this other game (which is not CMAKE and not slackbuild).

Enough for today.
Thank you for your attention.

Last edited by selfprogrammed; 05-30-2024 at 05:27 PM.
 
1 members found this post helpful.
Old 05-30-2024, 11:08 PM   #64
EdGr
Senior Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 1,005

Rep: Reputation: 476Reputation: 476Reputation: 476Reputation: 476Reputation: 476
I spotted a basic error in SharedPtr and SharedBuffer in utility.h:

Code:
private:
        void drop()
        {
                assert((*refcount) > 0);
                (*refcount)--;

                if(*refcount == 0)
                {
                        if(data)
                                delete[] data;
                        delete refcount;
                }
        }
        T *data;
        unsigned int m_size;
        unsigned int *refcount;
The operation is not atomic! Accesses to refcount need to be done inside a mutex.
Ed
 
2 members found this post helpful.
Old 05-31-2024, 07:18 AM   #65
BrunoLafleur
Member
 
Registered: Apr 2020
Location: France
Distribution: Slackware
Posts: 428

Rep: Reputation: 388Reputation: 388Reputation: 388Reputation: 388
Quote:
Originally Posted by EdGr View Post
I spotted a basic error in SharedPtr and SharedBuffer in utility.h:

Code:
private:
        void drop()
        {
                assert((*refcount) > 0);
                (*refcount)--;

                if(*refcount == 0)
                {
                        if(data)
                                delete[] data;
                        delete refcount;
                }
        }
        T *data;
        unsigned int m_size;
        unsigned int *refcount;
The operation is not atomic! Accesses to refcount need to be done inside a mutex.
Ed
Yes I have changed that and also in libirrlicht but it seems it is not the only problem. I have done some tests and I had some segfaults that I have corrected but there are others.

I don't know if they are the same bugs as with the OP. But they seems correlated. I continue to investigate.

Last edited by BrunoLafleur; 05-31-2024 at 03:02 PM.
 
Old 05-31-2024, 03:01 PM   #66
BrunoLafleur
Member
 
Registered: Apr 2020
Location: France
Distribution: Slackware
Posts: 428

Rep: Reputation: 388Reputation: 388Reputation: 388Reputation: 388
Quote:
Originally Posted by selfprogrammed View Post
The point of that particular piece of code:
Code:
   if( mesh != NULL ) {
       mesh->drop();  // segfaults here, gdb showed that mesh = 0
   }
1. The segfault occurred before calling drop.
2. There is no reason to believe that there was some usage of mesh after drop was called and still within that line. Such is not evident in the source code.
3. That drop may destroy much structure within mesh, but it should not be able to touch the "this" pointer itself. I wish I could say that it cannot, but I can only say that I do not know of any way that the drop function, no matter what it does, could possibly damage the source of its "this" ptr. That this ptr would at that time be in a register, or passed on the stack as a parameter, and even having drop trying to modify "this" should not touch it.
This drop method is from /usr/include/irrlicht/IReferenceCounted.h and delete mesh itself (delete this if the refcount = 0). So yes it touch "this".

So if elsewhere mesh has been deleted via drop, mesh keep its value but is not valid any more. I think method like drop is bad practice because it is difficult to remenber where destruction of objects had been done in some other portion of a big software.
 
Old 06-01-2024, 03:59 PM   #67
BrunoLafleur
Member
 
Registered: Apr 2020
Location: France
Distribution: Slackware
Posts: 428

Rep: Reputation: 388Reputation: 388Reputation: 388Reputation: 388
I have put a patch here : https://github.com/BrunoLafleur/pbsl.../sbo/voxelands

It is doing some cleaning and bugs. There maybe others.
 
2 members found this post helpful.
Old 06-04-2024, 12:48 AM   #68
selfprogrammed
Member
 
Registered: Jan 2010
Location: Minnesota, USA
Distribution: Slackware 13.37, 14.2, 15.0
Posts: 641

Original Poster
Rep: Reputation: 156Reputation: 156
Thank you for the attention to this.
Most important: You ran voxelands and got some of the same segfaults ? That would eliminate all customizations unique to my machine.


I got a copy of the patch. It is lengthy. I can apply it blindly or I could apply parts of it. I have not decided yet, as too many patches to code under debug makes that debugging change radically and I would not know what I patched that changed anything (because the patch is too big).
I have not processed it yet, so the following is before seeing your actual patches.

Some arguments:
1. That drop() is used within more than one structure.
These are known:
SMesh
SMeshBuffer

It is probably up to the irrlicht user to use a mutex, if one is needed.

2. That drop() call in generate_mesh() is done within a mutex that is indicated by the mesh container.
See the function MapBlockMesh::generate_mesh .
Each mesh container should have an independent mutex. That mutex would be guarding that particular segment of the mesh data.
Should we be looking for a drop call that is NOT within that mesh container mutex?

I found several drop() calls:

MapBlockMesh::~MapBlockMesh()
-- Unguarded by any mutex call.
-- The possibility of any other thread accessing the structure while it is destructed is a far worse problem. This destructor should not be called while the structure is exposed to any other thread accessing it.
Exclusion must be ensured by the caller. Mutex is not needed, and might cause lock-up due to the caller having already locked it.
-- An arbitrary Mutex locked at this place is of no use, it must be the specific mutex that other users of the data are locking.

MapBlockMesh::generate_mesh()
-- the function that has all the segfaults
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after adding another reference.
-- modifies the buf after buf->drop(). This might be a race with another thread. I have a check on this, but it has not detected such.

MapBlockMesh::generate_mesh()
-- the function that has all the segfaults
-- This usage is guarded by a conditional mutex. It is possible that a mutex was not assigned.
-- I have not yet found the mutex that is used. It is not part of that mesh structure.
-- modifies the buf after buf->drop(). This might be a race with another thread. I have a check on this, but it has not detected such.

createNodeBoxMesh()
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after adding another reference.

createNodeBoxMesh()
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after handing the mesh to another new structure.

createModelMesh()
-- drop of a local temp file structure, not shared

extrudeARGB()
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after handing the mesh to another new structure.
-- handoff is done twice, buf and mesh.

createExtrudedMesh()
-- drop of a local temp structure

generateTextureFromMesh
-- Note in code about other segfaults due to drop().
-- commented out drop() calls
-- drop of a local temp structure, after is used for drawing

ExtrudedSpriteSceneNode::~ExtrudedSprteSceneNode()
-- Unguarded by any mutex call.

ExtrudedSpriteSceneNode::setSprite()
-- Unguarded by any mutex call.
-- drop after handoff to another structure using SetMesh()

ExtrudedSpriteSceneNode::setCube()
-- Unguarded by any mutex call.

ExtrudedSpriteSceneNode::setNodeBox()
-- Unguarded by any mutex call.

ExtrudedSpriteSceneNode::setArm()
-- Unguarded by any mutex call.

At this point I stopped looking for more....

--
It is sloppy practice to use the same drop() function where it should not ever delete the structure (because it handing the reference off to another structure).
There should be an explicit hand-off, that does not risk entanglements with the auto-delete of the reference counting.
The slightest mistake by the drop() user will cause the kind of problem that is evident.
It is not the mutex that is the problem.
The possible mistaken auto-delete of the structure, and having it overwritten by something else matches the symptoms better.
--


2b. If that mutex ptr in the mesh container was NULL, that would make the drop() unguarded.
Cannot tell yet what the possibility of that is yet. I will probably have to instrument that too.
I do have a test on some of the unguarded drop(), to detect if they actually delete the mesh.
They have not detected anything yet. Because the generate_mesh now refuses to segfault at that place anymore, I cannot check correlation with a segfault instance.

3. Even if drop() deletes the entire structure, I cannot see how it can modify the "mesh" pointer of the caller.
That "mesh" pointer would be passed to the drop() member function as a parameter, probably a register. That would be a copy of "mesh".
So how did the usage of mesh->drop() segfault ?
How did "mesh" get modified to NULL, before drop() was called ?

I think we have not yet found the wild-write itself.

Last edited by selfprogrammed; 06-04-2024 at 01:06 AM.
 
1 members found this post helpful.
Old 06-04-2024, 04:55 AM   #69
BrunoLafleur
Member
 
Registered: Apr 2020
Location: France
Distribution: Slackware
Posts: 428

Rep: Reputation: 388Reputation: 388Reputation: 388Reputation: 388
Quote:
Originally Posted by selfprogrammed View Post
Thank you for the attention to this.
Most important: You ran voxelands and got some of the same segfaults ? That would eliminate all customizations unique to my machine.
======================

Yes I have some which are not always exactly the same but are not far.
=======================

I got a copy of the patch. It is lengthy. I can apply it blindly or I could apply parts of it. I have not decided yet, as too many patches to code under debug makes that debugging change radically and I would not know what I patched that changed anything (because the patch is too big).
I have not processed it yet, so the following is before seeing your actual patches.
==========

I have done another one at the same place. I had corrected some things where it segfaults. It is mainly in server.cpp
In the new one, due to a segfault again in the previous patch, I find a missing mutex in server.cpp
Last evening I run without error but maybe I run not enough long in time (2 hours playing).

=====================

Some arguments:
1. That drop() is used within more than one structure.
These are known:
SMesh
SMeshBuffer

It is probably up to the irrlicht user to use a mutex, if one is needed.
=================================

I prefer an atomic counter and atomic functions to be sure to not conflic with other mutexes. But here the counter is probably not the problem as you say below.
====================

2. That drop() call in generate_mesh() is done within a mutex that is indicated by the mesh container.
See the function MapBlockMesh::generate_mesh .
Each mesh container should have an independent mutex. That mutex would be guarding that particular segment of the mesh data.
Should we be looking for a drop call that is NOT within that mesh container mutex?

I found several drop() calls:

MapBlockMesh::~MapBlockMesh()
-- Unguarded by any mutex call.
-- The possibility of any other thread accessing the structure while it is destructed is a far worse problem. This destructor should not be called while the structure is exposed to any other thread accessing it.
Exclusion must be ensured by the caller. Mutex is not needed, and might cause lock-up due to the caller having already locked it.
-- An arbitrary Mutex locked at this place is of no use, it must be the specific mutex that other users of the data are locking.
===============
Yes
================

MapBlockMesh::generate_mesh()
-- the function that has all the segfaults
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after adding another reference.
-- modifies the buf after buf->drop(). This might be a race with another thread. I have a check on this, but it has not detected such.

MapBlockMesh::generate_mesh()
-- the function that has all the segfaults
-- This usage is guarded by a conditional mutex. It is possible that a mutex was not assigned.
-- I have not yet found the mutex that is used. It is not part of that mesh structure.
-- modifies the buf after buf->drop(). This might be a race with another thread. I have a check on this, but it has not detected such.

createNodeBoxMesh()
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after adding another reference.

createNodeBoxMesh()
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after handing the mesh to another new structure.

createModelMesh()
-- drop of a local temp file structure, not shared

extrudeARGB()
-- Unguarded by any mutex call.
-- But this usage should only drop a reference, as it is immediately after handing the mesh to another new structure.
-- handoff is done twice, buf and mesh.

createExtrudedMesh()
-- drop of a local temp structure

generateTextureFromMesh
-- Note in code about other segfaults due to drop().
-- commented out drop() calls
-- drop of a local temp structure, after is used for drawing

ExtrudedSpriteSceneNode::~ExtrudedSprteSceneNode()
-- Unguarded by any mutex call.

ExtrudedSpriteSceneNode::setSprite()
-- Unguarded by any mutex call.
-- drop after handoff to another structure using SetMesh()

ExtrudedSpriteSceneNode::setCube()
-- Unguarded by any mutex call.

ExtrudedSpriteSceneNode::setNodeBox()
-- Unguarded by any mutex call.

ExtrudedSpriteSceneNode::setArm()
-- Unguarded by any mutex call.

At this point I stopped looking for more....

--
It is sloppy practice to use the same drop() function where it should not ever delete the structure (because it handing the reference off to another structure).
==================
I also consider it bad. The voxelands code is not easy to follow especially for the thread / mutex things.
===================

There should be an explicit hand-off, that does not risk entanglements with the auto-delete of the reference counting.
The slightest mistake by the drop() user will cause the kind of problem that is evident.
================
If in C++ there is a delete operator, it is for destroy pointers outside objects. The drop method destroys objects inside them, so it is difficult to follow and lead to errors which are not easy to track.
====================

It is not the mutex that is the problem.
====================
Not the drop one but the one we found above like the one I found on server.cpp
I don't know if I have found all the unprotected thread areas, but it is a major cause of voxelands failures which are very random ones.

============================

The possible mistaken auto-delete of the structure, and having it overwritten by something else matches the symptoms better.
==============================
overwritten probably in another thread. In the same thread, it is easier to track and I didn't find one.
===============================

--


2b. If that mutex ptr in the mesh container was NULL, that would make the drop() unguarded.
Cannot tell yet what the possibility of that is yet. I will probably have to instrument that too.
================
Yes but I didn't see that case.
============================

I do have a test on some of the unguarded drop(), to detect if they actually delete the mesh.
They have not detected anything yet. Because the generate_mesh now refuses to segfault at that place anymore, I cannot check correlation with a segfault instance.

3. Even if drop() deletes the entire structure, I cannot see how it can modify the "mesh" pointer of the caller.
That "mesh" pointer would be passed to the drop() member function as a parameter, probably a register. That would be a copy of "mesh".
So how did the usage of mesh->drop() segfault ?
========================================
Because that mesh pointer has a value but is no longer valid : that is it points towards a deleted object.
But the only case I could think of is that another thread has deleted it.

=============================

How did "mesh" get modified to NULL, before drop() was called ?

I think we have not yet found the wild-write itself.
==================
In my patch I also reread item from the list in server.cpp in a place where it segfaulted. The list can change and an old pointer item on it can be invalid.
===================
Commentaries inside the above quote.

Last edited by BrunoLafleur; 06-04-2024 at 09:48 AM.
 
Old 06-04-2024, 05:35 AM   #70
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,109

Rep: Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367
Just one comment to this: add comments with different colors, that helps a lot.
And another comment: some classes are only used in a single thread (server for example) therefore it won't conflict with other threads, there is no race condition or other tricky situations.
 
Old 06-04-2024, 05:52 AM   #71
EdGr
Senior Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 1,005

Rep: Reputation: 476Reputation: 476Reputation: 476Reputation: 476Reputation: 476
The program needs a re-think and possibly a re-do.

Allocation and deallocation are nearly always done in the serial section. Reference counting and multi-threading do not go together.

I don't know how much effort you want to spend. This is not a simple bug. A complete rewrite may be less work.
Ed
 
Old 06-04-2024, 06:14 AM   #72
BrunoLafleur
Member
 
Registered: Apr 2020
Location: France
Distribution: Slackware
Posts: 428

Rep: Reputation: 388Reputation: 388Reputation: 388Reputation: 388
Quote:
Originally Posted by pan64 View Post
Just one comment to this: add comments with different colors, that helps a lot.
And another comment: some classes are only used in a single thread (server for example) therefore it won't conflict with other threads, there is no race condition or other tricky situations.
The same server (pointer on a server object) is used from different threads above it. This usage is protected by mutexes in voxelands. But at least one was missing. Maybe I just found another one. The code is big but can be reviewed.
 
Old 06-04-2024, 09:16 AM   #73
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,109

Rep: Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367
I just started to compile this voxelands and it reports errors like this:
Quote:
/home/pan/voxelands/voxelands-next/src/utility.h:559:36: warning: pointer used after 'void operator delete(void*, std::size_t)' [-Wuse-after-free]
There is no need to go further, just fix all of these warnings and errors. But probably you have fixed them.
 
Old 06-04-2024, 12:34 PM   #74
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,109

Rep: Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367Reputation: 7367
I needed a half day to download voxelands and irrlicht, build gcc13 and install CodeChecker in docker and finally run it.
The result is hm. disappointing.
Code:
----==== Severity Statistics ====----
----------------------------
Severity | Number of reports
----------------------------
MEDIUM   |               125
HIGH     |                83
LOW      |                26
----------------------------
Code:
----==== Checker Statistics ====----
------------------------------------------------------------------
Checker name                        | Severity | Number of reports
------------------------------------------------------------------
optin.cplusplus.UninitializedObject | MEDIUM   |                14
cplusplus.NewDelete                 | HIGH     |                 7
optin.cplusplus.VirtualCall         | MEDIUM   |               102
core.uninitialized.UndefReturn      | HIGH     |                18
deadcode.DeadStores                 | LOW      |                26
cplusplus.StringChecker             | HIGH     |                 1
core.CallAndMessage                 | HIGH     |                 4
cplusplus.NewDeleteLeaks            | HIGH     |                15
unix.Malloc                         | MEDIUM   |                 4
security.FloatLoopCounter           | MEDIUM   |                 1
core.NullDereference                | HIGH     |                21
core.NonNullParamChecker            | HIGH     |                 4
alpha.security.cert.env.InvalidPtr  | MEDIUM   |                 4
core.UndefinedBinaryOperatorResult  | HIGH     |                 7
core.uninitialized.Assign           | HIGH     |                 4
core.StackAddressEscape             | HIGH     |                 2
------------------------------------------------------------------
it was still unconfigured (codechecker), only the default settings were used, can be improved.
The result contains 234 error messages. Probably you already fixed a few.
 
Old 06-05-2024, 04:53 PM   #75
selfprogrammed
Member
 
Registered: Jan 2010
Location: Minnesota, USA
Distribution: Slackware 13.37, 14.2, 15.0
Posts: 641

Original Poster
Rep: Reputation: 156Reputation: 156
Thank you for the compilations, and error confirmation.

---
I run voxelands using gdb to catch all faults, so I can examine them.
I have several times observed that the structure was entirely trashed (this).

---
I have been considering what kind of bug could cause the symptoms.

One problem with a threading fault is that the corruption would not be severe enough.
That might mix writes from two threads, but that data would at least be otherwise valid ptrs.

The debug checkers and canaries in the code before the fault location did not print out anything, so I must assume that the data was correct up to that point.

If the "this" pointer on the stack got hit, it would modify data in some other location, and this would match the symptoms.

If a data structure got deleted, and overwritten, then another user of that structure would also get wild ptrs, and may use them.
This also can happen in the middle of a function, without modifying the this pointer.

Unfortunately, the more checks I put in, the more likely that the fault will no longer manifest at all.
I do get bad game behavior, but it is not fatal, just annoying.

---
I have seen before, where a ptr was used to modify a structure after the structure had been deleted.
That works, it does not segfault. The C++ delete does not invalidate the ptr.
It exposes that data to being allocated, possibly in another thread.
More than one segfault, when examined in gdb, has been like this.
Best that I can think of is that the stack must have been hit by a wild-write from the other thread. Can that happen ??
If Paging prevents that, then how did that mesh ptr become 0, immediately after a check for != NULL.

I saw another case where the virtual support ptr for an object had been set to 0xfffffff4, which caused everything virtual to segfault, and that object was entirely virtual functions. I could not see this for myself, as I looked before the object, and could not find that value.

---
I consider this to be a Heisenbug, because it moves or disappears when you try to look at it. It is probably the third Heisenbug that I have encountered.

---
I have considered that the code could use an entire rewrite. I have too many game projects already to adopt it.

The need to find this Heisenbug, is mostly to make sure it does not become a problem in another of my other projects. That would apply to other peoples projects too.
The interesting part, is what is causing the Heisenbug to alter its presentation. What can be done to detect it, and prevent it.

After that, this would just be a voxelands maintenance effort, assuming there is anyone there that would accept the patches.
It would be a major effort.

---
I have other projects that get new warnings with every new GCC update. I don't think they are helping much, as most of my effort is in shutting it up, because it was known to not be an error.
Lots of it are printing to a buffer, and it might truncate, so they feel the need to warning you of everthing like that.
Are any of the GCC 13 or CodeChecker warning messages about something promising?

Last edited by selfprogrammed; 06-05-2024 at 05:07 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
step 5.10 gcc/limitx.h gcc/glimits.h gcc/limity.h no such file or directory iambrj Linux From Scratch 7 08-07-2018 11:22 AM
I have two version of GCC on the same linux system. How do i make the latest GCC version as default C++ compiler? seke Linux - General 9 08-06-2018 09:46 PM
LXer: GCC 4.9 vs. GCC 5.1 vs. GCC 6.0 SVN Compiler Benchmarks LXer Syndicated Linux News 0 06-08-2015 01:00 PM
[SOLVED] SEGMENTATION FAULT using gcc 4.4.4 -O2 , works with gcc 4.1.0 -O2 or gcc 4.4.4 -O1 amir1981 Programming 36 07-26-2010 06:07 PM
Regarding distribution + kernel version + gcc version + glib version. JCipriani Linux - General 8 04-19-2008 02:54 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 05:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration