LinuxQuestions.org - [SOLVED] nVidia proprietary driver segfaults unless run with gdb [C++]

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - nVidia proprietary driver segfaults unless run with gdb [C++] (https://www.linuxquestions.org/questions/programming-9/nvidia-proprietary-driver-segfaults-unless-run-with-gdb-%5Bc-%5D-4175416983/)

nVidia proprietary driver segfaults unless run with gdb [C++]

Hiya,

I'm trying to track down a bug in one of my C++ programmes - it has all the classic symptoms which I associate with memory errors (the key one is that moving my debug output code around changes whether or not it segfaults) but valgrind doesn't complain about my program at all.

When run on its own we get:

Code:

$ make && ./duel

g++ -c duel.cpp -Wall -ggdb

g++ duel.o particles.o draw.o myutils.o bullet.o ship.o playership.o ghostship.o aiship.o -o duel -Wall -ggdb -lglut -lGL -lGLU 

Player ship is #0

AI ship at (-5,5,0) is #1

[1]    4227 segmentation fault  ./duel

With gdb:

Code:

$ gdb duel

<snip - the normal boomf at the start of gdb>

(gdb) r

Starting program: /home/joshua/scripts/c++/opengl/newton_duel/duel 

warning: Could not load shared library symbols for linux-vdso.so.1.

Do you need "set solib-search-path" or "set sysroot"?

Player ship is #0

AI ship at (-5,5,0) is #1

Forwards: (0,0,0)      5      0

Forwards: (0.0005,0,0)  4.9995  0.00025

Forwards: (0.001,0,0)  4.9985  0.001

<snip - normal program execution>

With valgrind:

Code:

$ valgrind --tool=memcheck duel

==4246== Memcheck, a memory error detector

==4246== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.

==4246== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info

==4246== Command: duel

==4246== 

==4246== Conditional jump or move depends on uninitialised value(s)

==4246==    at 0x4017876: index (in /usr/lib/ld-2.16.so)

==4246==    by 0x4007902: expand_dynamic_string_token (in /usr/lib/ld-2.16.so)

==4246==    by 0x4008204: _dl_map_object (in /usr/lib/ld-2.16.so)

==4246==    by 0x400180D: map_doit (in /usr/lib/ld-2.16.so)

==4246==    by 0x400E785: _dl_catch_error (in /usr/lib/ld-2.16.so)

==4246==    by 0x40010DB: do_preload (in /usr/lib/ld-2.16.so)

==4246==    by 0x4004546: dl_main (in /usr/lib/ld-2.16.so)

==4246==    by 0x4014B5D: _dl_sysdep_start (in /usr/lib/ld-2.16.so)

==4246==    by 0x4004DFD: _dl_start (in /usr/lib/ld-2.16.so)

==4246==    by 0x4001627: ??? (in /usr/lib/ld-2.16.so)

==4246== 

==4246== Conditional jump or move depends on uninitialised value(s)

==4246==    at 0x401787B: index (in /usr/lib/ld-2.16.so)

==4246==    by 0x4007902: expand_dynamic_string_token (in /usr/lib/ld-2.16.so)

==4246==    by 0x4008204: _dl_map_object (in /usr/lib/ld-2.16.so)

==4246==    by 0x400180D: map_doit (in /usr/lib/ld-2.16.so)

==4246==    by 0x400E785: _dl_catch_error (in /usr/lib/ld-2.16.so)

==4246==    by 0x40010DB: do_preload (in /usr/lib/ld-2.16.so)

==4246==    by 0x4004546: dl_main (in /usr/lib/ld-2.16.so)

==4246==    by 0x4014B5D: _dl_sysdep_start (in /usr/lib/ld-2.16.so)

==4246==    by 0x4004DFD: _dl_start (in /usr/lib/ld-2.16.so)

==4246==    by 0x4001627: ??? (in /usr/lib/ld-2.16.so)

==4246== 

Player ship is #0

AI ship at (-5,5,0) is #1

Forwards: (0,0,0)      5      0

==4246== Invalid write of size 8

==4246==    at 0x7DE7E72: ??? (in /usr/lib/libnvidia-glcore.so.302.17)

==4246==  Address 0x61a000 is not stack'd, malloc'd or (recently) free'd

==4246== 

==4246== 

==4246== Process terminating with default action of signal 11 (SIGSEGV)

==4246==  Access not within mapped region at address 0x61A000

==4246==    at 0x7DE7E72: ??? (in /usr/lib/libnvidia-glcore.so.302.17)

==4246==  If you believe this happened as a result of a stack

==4246==  overflow in your program's main thread (unlikely but

==4246==  possible), you can try to increase the size of the

==4246==  main thread stack using the --main-stacksize= flag.

==4246==  The main thread stack size used in this run was 8388608.

==4246== 

==4246== HEAP SUMMARY:

==4246==    in use at exit: 7,874,762 bytes in 1,873 blocks

==4246==  total heap usage: 2,669 allocs, 796 frees, 13,445,190 bytes allocated

==4246== 

==4246== LEAK SUMMARY:

==4246==    definitely lost: 0 bytes in 0 blocks

==4246==    indirectly lost: 0 bytes in 0 blocks

==4246==      possibly lost: 2,383,893 bytes in 99 blocks

==4246==    still reachable: 5,490,869 bytes in 1,774 blocks

==4246==        suppressed: 0 bytes in 0 blocks

==4246== Rerun with --leak-check=full to see details of leaked memory

==4246== 

==4246== For counts of detected and suppressed errors, rerun with: -v

==4246== Use --track-origins=yes to see where uninitialised values come from

==4246== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

[1]    4246 segmentation fault  valgrind --tool=memcheck duel

Note that it's not my code that's segfaulting, but the nvidia driver

Now, the last time I ran valgrind over my code (a week or so ago) I got nothing from it at all. From my pacman log, it appears I upgraded to glibc 2.16.0-2 and nvidia 302.17-2 yesterday (15th of July), so I'm guessing this is stemming from these updates.

I can't construct a simple, self-contained code snippet because the error only appears if certain output statements are there (just redirecting things to std::cout), and so I can't really track the error down to a particular line or code block - though I know which routine seems to be causing the problem.

Is there any way I can resolve this problem? Is it a bug which needs reporting, and if so, to Arch, nVidia (hah), or the maintainers of glibc? As always, if there's any information I can provide, just let me know.

Thanks!

Hey Snark,

would you care to share your source code (plus Makefile or such)? Maybe you're just initializing the libraries (or related objects) in a way they don't expect, and so the error shows its puny head long after the actual cause has taken place?

Cheers,
dmdeb

Well, the source code's massive and split over several files. I don't think I'm initialising anything incorrectly because it all worked fine until recently, but it's entirely possible you're correct. I'll create a github repository for it tomorrow and link to that from here (I don't want to spam the thread with all the code...)

Cheers for the response

Right, I've created the github repository, it's https://github.com/Snark1994/newton-duel/. The 'master' branch is fine, the problem's in the 'devel' branch. The problem emerged after writing the code for the AI ships (aiship.cpp, the function "void AIShip::update(void);"). If I remove all the 'cout <<' statements in the function, then it runs without a problem, and as stated previously it runs fine under gdb...

EDIT: if it makes any odds, it also runs find under windows with mingw

Thanks,

Hi.

Cool! On my onboard Intel 4500MHD there are no segfaults. It would be great if you keep github repository up-to-date.

Quote:

Originally Posted by Snark1994 (Post 4730419)

Hi Snark,

thanks for sharing. Unfortunately (in a way), both the master and the devel branch work perfectly fine on my system (nvidia x86_64-295.40). Valgrind did not report any errors either. So much for the quick shot... I'll try to have a deeper look now.

Best regards
dmdeb

@firstfire: I will try my best to do so! And if you're interested in the project (especially the graphics side - I've never done any of that before) you're more than welcome to contribute :)

@dmdeb: Thanks for trying it, what's your glibc version? If I can find or create a system with glibc 2.16.0-2 with an earlier version of nvidia, then install the newest nvidia version, then we've narrowed it down to a nvidia bug...

Quote:

Originally Posted by Snark1994 (Post 4730857)

Hey there,

my libc says "GNU C Library (Debian EGLIBC 2.11.3-3) stable release version 2.11.3". Way old, as usual for a Debian stable system...

I sent an email with details to your gmail address in order not to stress this thread too much.

Cheers
dmdeb

Right, got it fixed with the help of dmdeb (wish I could give you more reputation for that, but LQ tells me to 'spread it around' and I feel marking your previous post as helpful would be slightly misleading...).

The suggestion which worked was the following:

Quote:

The updateUniverse() function loops over the objectList and,
where requested, erases items - or even adds new ones (Emitters).
I'd be ultra-careful about modifying containers while iterating
over them; the world is full of weird bugs caused by such attempts
(and I am a fan of defensive programming, too). You could postpone
the erasing to after the loop using remove_if() or so, and maintain
a temporary list of new Emitters to be added to the objectList in
another step.

My problem was that I was erasing/adding items to a vector over which I was iterating, which was the cause of the segfault.

Thank you very much, dmdeb!