SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Location: Lausanne - Switzerland ( Bordeaux - France / Montreal - QC - Canada)
Distribution: Slackware Leet - 32/64bit
Posts: 152
Rep:
Slackware 64 - Static compilation broken !
Hi,
I have a weird bug : segmentation fault appears when executing the 'retq' instruction of my sigalrm callback in static link...
It seems it happens only on slackware...
Here is a simple test case, compiled in shared -> no problem, static -> crash...
Paste the following script in a file named "test-sigalrm-pack2.sh", and execute it: it will generate the C++ source and a simple build/test script.... Just launch the build script (tst-sigalrm-build).
I suspect some 'mismatch' in the static libraries that uses some '32bit' somewhere and when the 'retq' pops back the return adress, it's totally wrong (it's my guess, but I have no clue, after several weeks of debugging with ddd/google/glic mailing list/LQ programming forums)
That's so stupid... I need alarm to make my cursor to blink !
Location: Lausanne - Switzerland ( Bordeaux - France / Montreal - QC - Canada)
Distribution: Slackware Leet - 32/64bit
Posts: 152
Original Poster
Rep:
Forgot to mention...
Sorry I forgot to mention... The same code was doing good on slackware 32, I just encountered it switching my system to slackware 64 (I use -current branch).
I had some feedback from ubuntu 64 users being able to run it without problem, but I have no guarantee at the time that they really tried static compilation.
Location: Lausanne - Switzerland ( Bordeaux - France / Montreal - QC - Canada)
Distribution: Slackware Leet - 32/64bit
Posts: 152
Original Poster
Rep:
Quote:
Originally Posted by gnashley
You may need '-fPIC' in the g++ options.
Hey thank you,
I tried, and it didn't change anything, still crash at the same exact place for the same reason.
Anyway I didn't believed it was that, because as I mentionned : generated code is 64bit (it's a retQ seen in the debugger...) and it's working without problem in shared model which doesn't change "anything" but the glibc library version used to link. (And I checked gcc target config which is, as expected, default on x86_64...)
So am I the only one who got this piece of code crashing on slackware64-current ?
Distribution: slackware64 13.37 and -current, Dragonfly BSD
Posts: 1,810
Rep:
Quote:
So am I the only one who got this piece of code crashing on slackware64-current ?
I haven't tried this on current but out of curiosity I tried it on Slackware64 -13 and it crashes the same with a segmentation fault. The shared lib version runs fine.
I'm afraid I don't know much about building statically and so can't be of any guidance.
I tried converting it to pure C and using gcc rather than g++ but it does exactly the same thing.
I also tried changing it to use the slightly simpler action.sa_handler invocation rather than using action.sa_sigaction but again, it gives exactly the same segfault issue when built statically.
I've done a little more digging on this one. I worked with my C version of the code, which is slightly different to the OPs, but show the same symptoms (I'm not much good with C++)
gazl-sig.c:
Code:
#include <sys/time.h>
#include <signal.h>
#include <stdio.h>
static int alarmed;
struct sigaction action,oldaction;
void onAlarmSignal( int signal)
{
printf("Tick!\n");
++alarmed;
}
void registerSignal()
{
action.sa_handler=onAlarmSignal;
sigemptyset(&action.sa_mask);
sigaction(SIGALRM,&action,&oldaction);
}
void startTimer()
{
struct itimerval value;
value.it_interval.tv_sec=0;
value.it_interval.tv_usec=100;
value.it_value=value.it_interval;
setitimer(ITIMER_REAL, &value, NULL);
}
int main( int argc, const char *argv[])
{
alarmed=0;
registerSignal();
startTimer();
do
;
while (alarmed <10);
return 0;
}
Then I compiled it with:
gcc -Wall -O0 -g -static gazl-sig.c -o sig-static
Now for the interesting bit... gdb time.
Code:
(gdb) run
Starting program: /tmp/sig-static
Tick!
Program received signal SIGSEGV, Segmentation fault.
0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12 }
(gdb) disassemble
Dump of assembler code for function onAlarmSignal:
0x00000000004002ac <onAlarmSignal+0>: push %rbp
0x00000000004002ad <onAlarmSignal+1>: mov %rsp,%rbp
0x00000000004002b0 <onAlarmSignal+4>: sub $0x10,%rsp
0x00000000004002b4 <onAlarmSignal+8>: mov %edi,-0x4(%rbp)
0x00000000004002b7 <onAlarmSignal+11>: mov $0x46f824,%edi
0x00000000004002bc <onAlarmSignal+16>: callq 0x4010c0 <puts>
0x00000000004002c1 <onAlarmSignal+21>: mov 0x296009(%rip),%eax # 0x6962d0 <alarmed>
0x00000000004002c7 <onAlarmSignal+27>: add $0x1,%eax
0x00000000004002ca <onAlarmSignal+30>: mov %eax,0x296000(%rip) # 0x6962d0 <alarmed>
0x00000000004002d0 <onAlarmSignal+36>: leaveq
0x00000000004002d1 <onAlarmSignal+37>: retq
End of assembler dump.
The segfault is occurring on the retq instruction.
The next question is where is it trying to return to? So, lets set a breakpoint and examine the return address on the stack just before it tries to return:
Code:
(gdb) break *0x4002d1
Breakpoint 1 at 0x4002d1: file gazl-sig.c, line 12.
(gdb) run
Starting program: /tmp/sig-static
Tick!
Breakpoint 1, 0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12 }
(gdb) x/a $rsp
0x7fff1800d3f8: 0xf0000000fc0c748
(gdb) x 0xf0000000fc0c748
0xf0000000fc0c748: Cannot access memory at address 0xf0000000fc0c748
(gdb)
I may be misinterpreting it, but it looks like it's trying to return to the middle of nowhere.
The shared lib version seems to work in a completely different manner and actually returns to some executable code on the stack (I guess that's how shared libraries work).
Anyway, that's about as far as I can go with this. I don't have the knowledge to dig any deeper.
Location: Lausanne - Switzerland ( Bordeaux - France / Montreal - QC - Canada)
Distribution: Slackware Leet - 32/64bit
Posts: 152
Original Poster
Rep:
Thanks all for your feedback.
So some simple things :
- it happens only on 64 bit slackware (being tested succesfully on ubuntu 64)
- it happens only on static link.
- it happens on a system call
- it happens systematically, and even with an empty function, so no 'memory/stack/buffer override' here.
My obvious guess is that the 'retq' as not the same 'size' of the call (it's been called by a 'call' not a 'callq'). That would explain the totally broken pointer.
I'm missing some 'underground' knowledge here so I 'guess' that the caller is the kernel himself. Then if I don't mistake, kernel supports 32bit binaries 'as well' (silently) (nothing related to 'external shared libraries', I'm talking on a static linkage point of view). But to do the 'right call' (calling as the signal as 32bit handler or 64bit handler) there might be 'somewhere' where the kernels get this info.
I mean... "simply"... I think the 64bit kernel can handle both 32bit and 64bit processes... I think that the statically linked binary might be tagged as '32bit' somewhere... But a dump of elf infos still shows a 64bit binary (ld does a good job)... So that might be when the glibc registers the sigaction somewhere or something that is done directly by the buggy compiled process that send the kernel wrong informations.
I guess most of this glue code to be in the glibc, and/or tightly coupled with some gcc crtX.o runtime.
I've tried to look at the 'gcc' package slackbuild and it seemed alright, I mean it should take care of 64bit (and it takes care of that for shared libs), and from first observation, it should do what expected. But I can't help suspecting the static glibc libs to be built with some wrong option...
So I don't have any new way to look into, I have this 'guess' but don't know how to prove/unprove it. And don't know how to find the 'guilty one' in that chain.
Is there anybody working on the Slackware x86/64 build around ?
It might just be a slackbuild 'hack' to do.
Thank you all for the support.
Cheers
Garry.
Last edited by NoStressHQ; 04-26-2010 at 12:55 AM.
Reason: Some corrections and precisions.
Distribution: slackware64 13.37 and -current, Dragonfly BSD
Posts: 1,810
Rep:
Just some more information to be going on with as this intrigues me.
I have turned the source into an Eclipse cpp project and put in the appropriate -static linker flags for Eclipse. This builds a statically linked executable, (as,just to be certain, "file my_alarm" confirms for me). The resulting binary seg faults as usual when ran from cli but runs OK from within Eclipse IDE! Hmm, strange.
Location: Lausanne - Switzerland ( Bordeaux - France / Montreal - QC - Canada)
Distribution: Slackware Leet - 32/64bit
Posts: 152
Original Poster
Rep:
Quote:
Originally Posted by bgeddy
Just some more information to be going on with as this intrigues me. [...]
Thanks for the help. This makes me ask if there were 'something' different from a process spawn and a CLI launch. I mean, I thought that the kernel, somehow (elf infos?) get the binary 'bit size'. I assume that the binary you're using is the same whereas you launch it from Eclipse or CLI... So... Is there a way for the calling process to tell the system in which 'bit depth mode' the binary is ? Maybe it's the 'fork/exec' pair that copy the eclipse's 64bit 'flags' to its child process, and could bypass the elf infos baked in binary ? (Big guess here, but the fork would explain that, on the other hand the shell forks too to launch a binary...)
This is an interesting new behavior, yet it's still a mistery !
Looks like you may be relying on the compiler to fix up your code.
You're writing c synatax in a c plus plus compiler.
using namespace std;
namespace foo
{
void main( void )
{
cout << "my message";
}
}
Chances are the C compiler or the glibc could be broken though. I remember back in the day there was a return error that needed patching when you upgraded GCC. EGCS or something.
This machine is Ubuntu 9.10 so it's compiler version is
# gcc -v
Thread model: posix
gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)
Check your version and check google to see if there are reports for that version of the compiler.
This force a 32bit ret, but it doesn't fix the crash... So, my guess was wrong. It's not related to a call size mismatch...
Anyone else for some clue here ?
Note: You can write this code in ASM it'll still crash... That problem is not a 'religious syntax problem' C++ vs C or whatever, it's about static standard library build... It's a 'system programming' problem, not a "I don't know how to write this code". This is a bug test case, doesn't represent the 'real life code' of course.
Thanks !
Cheers.
Garry.
---- If it can be usefull ----
Target: x86_64-slackware-linux
Configured with: ../gcc-4.4.3/configure --prefix=/usr --libdir=/usr/lib64 --enable-shared --enable-bootstrap --enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix --enable-checking=release --with-system-zlib --with-python-dir=/lib64/python2.6/site-packages --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp --with-gnu-ld --verbose --disable-multilib --target=x86_64-slackware-linux --build=x86_64-slackware-linux --host=x86_64-slackware-linux
Thread model: posix
gcc version 4.4.3 (GCC)
--------------------------------
Last edited by NoStressHQ; 04-28-2010 at 02:51 PM.
Reason: Added gcc/platform infos...
Ok, found out a little more.
By debugging the shared version of the program I've found that the return address on the stack
points to symbol __restore_rt in libc.so.6:
Code:
(gdb) c
Continuing.
Tick!
Breakpoint 2, 0x0000000000400601 in onAlarmSignal (signal=1) at gazl-sig.c:12
12 }
(gdb) x/a $rsp
0x7fff703940b8: 0x7fc42871d450 <__restore_rt>
(gdb) info shared
From To Syms Read Shared Object Library
0x00007fc428a5ba90 0x00007fc428a73ed4 Yes /lib64/ld-linux-x86-64.so.2
0x00007fc4287087e0 0x00007fc42880a4a4 Yes /lib64/libc.so.6
(gdb)
Now when we look at the return address in the statically linked program, we can see
that __restore_rt is at a different location than that of the return address on the top of the stack (which actually looks like an int to me rather than an address):
Code:
(gdb) c
Continuing.
Tick!
Breakpoint 2, 0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12 }
(gdb) x/a $rsp
0x7fffa3a31a78: 0xf0000000fc0c748
(gdb) info address __restore_rt
Symbol "__restore_rt" is at 0x400af0 in a file compiled without debugging.
(gdb) info symbol __restore_rt
__restore_rt in section .text
Now, if I manually change that return address on the stack to point to __restore_rt, the program seems to continue correctly (for 1 iteration, at which point the stack has the wrong value again):
Code:
(gdb) set {long} 0x7fffa3a31a78 = __restore_rt
(gdb) x/a $rsp
0x7fffa3a31a78: 0x400af0 <__restore_rt>
(gdb) c
Continuing.
Tick!
Breakpoint 2, 0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12 }
(gdb)
Quite why the return address on the stack isn't pointing at "__restore_rt" I have no idea, but that looks like it's what's going wrong.
Distribution: slackware64 13.37 and -current, Dragonfly BSD
Posts: 1,810
Rep:
Yes, I figured the stack must be somehow getting messed and trashing the return. I have not, however, been able to give this as much attention today as I had hoped as my development environment has got trashed and needs fixing, (a long story - suffice to say Eclipse can be an absolute nightmare). It would be nice to pinpoint what was causing the frame to get messed up like this.
Nice one for the detective work and keeping us posted !
Location: Lausanne - Switzerland ( Bordeaux - France / Montreal - QC - Canada)
Distribution: Slackware Leet - 32/64bit
Posts: 152
Original Poster
Rep:
Hi,
Thanks for the trace, that's effectively what I got too.
Meanwhile I tried to find some informations.
restore_rt is a special address used by glibc (look in 'signal.c' of the appropriate architecture). I've read that when you do a kernel call, that symbol is 'inserted' in the stack as a return address for the signal handler. Sorry I can't find where I've read that. But you should be able to find this info if you look around "signal" "restore_rt" and such keywords.
Also, as I have a lot of statically compiled programs, not requiring signals, I've found that trying to trace such a program with gdb made gdb freezes quite quickly (you might need two sources and a call from main)... So those programs are working well (if no bug ) because they don't do 'signals', but when trying to trace (if a bug ) gdb quickly freezes. First I thought it was ddd, but CLI gdb does the same. (EDIT: After some more tests I'm not 100% sure about that, it seems GDB just takes ages sometimes, but still it's far longer that what I experienced on Slack32).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.