Slackware 64 - Static compilation broken !
Hi,
I have a weird bug : segmentation fault appears when executing the 'retq' instruction of my sigalrm callback in static link... It seems it happens only on slackware... Here is a simple test case, compiled in shared -> no problem, static -> crash... Paste the following script in a file named "test-sigalrm-pack2.sh", and execute it: it will generate the C++ source and a simple build/test script.... Just launch the build script (tst-sigalrm-build). Code:
#!/bin/sh That's so stupid... I need alarm to make my cursor to blink ! :) Thanks Garry. |
Forgot to mention...
Sorry I forgot to mention... The same code was doing good on slackware 32, I just encountered it switching my system to slackware 64 (I use -current branch).
I had some feedback from ubuntu 64 users being able to run it without problem, but I have no guarantee at the time that they really tried static compilation. Cheers. |
You may need '-fPIC' in the g++ options.
|
Quote:
I tried, and it didn't change anything, still crash at the same exact place for the same reason. Anyway I didn't believed it was that, because as I mentionned : generated code is 64bit (it's a retQ seen in the debugger...) and it's working without problem in shared model which doesn't change "anything" but the glibc library version used to link. (And I checked gcc target config which is, as expected, default on x86_64...) So am I the only one who got this piece of code crashing on slackware64-current ? Cheers Garry. |
Quote:
I'm afraid I don't know much about building statically and so can't be of any guidance. |
I can confirm a segfault on 64-current.
I tried converting it to pure C and using gcc rather than g++ but it does exactly the same thing. I also tried changing it to use the slightly simpler action.sa_handler invocation rather than using action.sa_sigaction but again, it gives exactly the same segfault issue when built statically. Some sort of libc bug perhaps? |
I've done a little more digging on this one. I worked with my C version of the code, which is slightly different to the OPs, but show the same symptoms (I'm not much good with C++)
gazl-sig.c: Code:
#include <sys/time.h> gcc -Wall -O0 -g -static gazl-sig.c -o sig-static Now for the interesting bit... gdb time. Code:
(gdb) run The next question is where is it trying to return to? So, lets set a breakpoint and examine the return address on the stack just before it tries to return: Code:
(gdb) break *0x4002d1 The shared lib version seems to work in a completely different manner and actually returns to some executable code on the stack (I guess that's how shared libraries work). Anyway, that's about as far as I can go with this. I don't have the knowledge to dig any deeper. |
Thanks all for your feedback.
So some simple things : - it happens only on 64 bit slackware (being tested succesfully on ubuntu 64) - it happens only on static link. - it happens on a system call - it happens systematically, and even with an empty function, so no 'memory/stack/buffer override' here. My obvious guess is that the 'retq' as not the same 'size' of the call (it's been called by a 'call' not a 'callq'). That would explain the totally broken pointer. I'm missing some 'underground' knowledge here so I 'guess' that the caller is the kernel himself. Then if I don't mistake, kernel supports 32bit binaries 'as well' (silently) (nothing related to 'external shared libraries', I'm talking on a static linkage point of view). But to do the 'right call' (calling as the signal as 32bit handler or 64bit handler) there might be 'somewhere' where the kernels get this info. I mean... "simply"... I think the 64bit kernel can handle both 32bit and 64bit processes... I think that the statically linked binary might be tagged as '32bit' somewhere... But a dump of elf infos still shows a 64bit binary (ld does a good job)... So that might be when the glibc registers the sigaction somewhere or something that is done directly by the buggy compiled process that send the kernel wrong informations. I guess most of this glue code to be in the glibc, and/or tightly coupled with some gcc crtX.o runtime. I've tried to look at the 'gcc' package slackbuild and it seemed alright, I mean it should take care of 64bit (and it takes care of that for shared libs), and from first observation, it should do what expected. But I can't help suspecting the static glibc libs to be built with some wrong option... So I don't have any new way to look into, I have this 'guess' but don't know how to prove/unprove it. And don't know how to find the 'guilty one' in that chain. Is there anybody working on the Slackware x86/64 build around ? It might just be a slackbuild 'hack' to do. Thank you all for the support. Cheers Garry. |
Just some more information to be going on with as this intrigues me.
I have turned the source into an Eclipse cpp project and put in the appropriate -static linker flags for Eclipse. This builds a statically linked executable, (as,just to be certain, "file my_alarm" confirms for me). The resulting binary seg faults as usual when ran from cli but runs OK from within Eclipse IDE! Hmm, strange. |
Quote:
This is an interesting new behavior, yet it's still a mistery ! :) Cheers Garry. |
I was looking at your code.
Looks like you may be relying on the compiler to fix up your code. You're writing c synatax in a c plus plus compiler. using namespace std; namespace foo { void main( void ) { cout << "my message"; } } Chances are the C compiler or the glibc could be broken though. I remember back in the day there was a return error that needed patching when you upgraded GCC. EGCS or something. This machine is Ubuntu 9.10 so it's compiler version is # gcc -v Thread model: posix gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9) Check your version and check google to see if there are reports for that version of the compiler. |
I've run a test to check my guess... I thought the caller was doing some 32bit call to the 64bit callback...
So I 'hacked' the callback this way : Code:
void _onAlarmSignal(int signal,siginfo_t* sigInfo,pvoid pUContext) { Anyone else for some clue here ? Note: You can write this code in ASM it'll still crash... That problem is not a 'religious syntax problem' C++ vs C or whatever, it's about static standard library build... It's a 'system programming' problem, not a "I don't know how to write this code". This is a bug test case, doesn't represent the 'real life code' of course. Thanks ! Cheers. Garry. ---- If it can be usefull ---- Target: x86_64-slackware-linux Configured with: ../gcc-4.4.3/configure --prefix=/usr --libdir=/usr/lib64 --enable-shared --enable-bootstrap --enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix --enable-checking=release --with-system-zlib --with-python-dir=/lib64/python2.6/site-packages --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp --with-gnu-ld --verbose --disable-multilib --target=x86_64-slackware-linux --build=x86_64-slackware-linux --host=x86_64-slackware-linux Thread model: posix gcc version 4.4.3 (GCC) -------------------------------- |
Ok, found out a little more.
By debugging the shared version of the program I've found that the return address on the stack points to symbol __restore_rt in libc.so.6: Code:
(gdb) c that __restore_rt is at a different location than that of the return address on the top of the stack (which actually looks like an int to me rather than an address): Code:
(gdb) c Code:
(gdb) set {long} 0x7fffa3a31a78 = __restore_rt |
Yes, I figured the stack must be somehow getting messed and trashing the return. I have not, however, been able to give this as much attention today as I had hoped as my development environment has got trashed and needs fixing, (a long story - suffice to say Eclipse can be an absolute nightmare). It would be nice to pinpoint what was causing the frame to get messed up like this.
Nice one for the detective work and keeping us posted ! |
Hi,
Thanks for the trace, that's effectively what I got too. Meanwhile I tried to find some informations. restore_rt is a special address used by glibc (look in 'signal.c' of the appropriate architecture). I've read that when you do a kernel call, that symbol is 'inserted' in the stack as a return address for the signal handler. Sorry I can't find where I've read that. But you should be able to find this info if you look around "signal" "restore_rt" and such keywords. Also, as I have a lot of statically compiled programs, not requiring signals, I've found that trying to trace such a program with gdb made gdb freezes quite quickly (you might need two sources and a call from main)... So those programs are working well (if no bug ;) ) because they don't do 'signals', but when trying to trace (if a bug :( ) gdb quickly freezes. First I thought it was ddd, but CLI gdb does the same. (EDIT: After some more tests I'm not 100% sure about that, it seems GDB just takes ages sometimes, but still it's far longer that what I experienced on Slack32). I'm pretty sure that this thread talks about exactly the same problem (but with no solution) : http://www.gossamer-threads.com/lists/openssh/dev/47519 So I still think that 'somehow' the static build of the glibc libraries are somehow broken (maybe vs static build of gcc+gdb, and so on...). It seems that if we don't find it ourselves, we're stucked :). I sincerely think that even if 'static build' is not so common nowadays, it should works, there are quite some situations that requires it. So we have to debug our Slackware64 build not to be ashamed by Ubuntu users ;). Cheers Garry. |
All times are GMT -5. The time now is 07:23 AM. |