LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (http://www.linuxquestions.org/questions/slackware-14/)
-   -   Slackware 64 - Static compilation broken ! (http://www.linuxquestions.org/questions/slackware-14/slackware-64-static-compilation-broken-803845/)

NoStressHQ 04-23-2010 05:36 PM

Slackware 64 - Static compilation broken !
 
Hi,

I have a weird bug : segmentation fault appears when executing the 'retq' instruction of my sigalrm callback in static link...
It seems it happens only on slackware...
Here is a simple test case, compiled in shared -> no problem, static -> crash...
Paste the following script in a file named "test-sigalrm-pack2.sh", and execute it: it will generate the C++ source and a simple build/test script.... Just launch the build script (tst-sigalrm-build).

Code:

#!/bin/sh
#test-sigalrm-pack2.sh
# 64bit sigalrm segmentation fault test case package...

echo " * Generating source..."
cat        >tst-sigalrm.cpp        <<TESTSRC
//tst-sigalrm.cpp
#include <stdio.h>
#include <unistd.h>
#include <wait.h>
#include <sys/time.h>

typedef        void*        pvoid;

namespace{
        volatile        unsigned        int        alarmed        =0;
        struct        sigaction        action,oldAction;

        void        _onAlarmSignal(int        signal,siginfo_t* sigInfo,pvoid pUContext) {
                printf("Tick !\n");
                ++alarmed;
        }

        void        _registerSignal() {
                action.sa_flags                =SA_SIGINFO;
                action.sa_sigaction        =_onAlarmSignal;
                action.sa_restorer        =NULL;
                sigemptyset(&action.sa_mask);

                sigaction(SIGALRM,&action,&oldAction);
        }

        void        _startTimer() {
                itimerval        value;
                value.it_interval.tv_sec        =0;
                value.it_interval.tv_usec        =100;
                value.it_value        =value.it_interval;
                setitimer(ITIMER_REAL,&value,NULL);
        }
}

int        main(int argc,const char **argv) {

        _registerSignal();
        _startTimer();

        do        ;        while(alarmed<10);

        return        0;
}
TESTSRC

echo " * Generating build script..."
cat        >tst-sigalrm-build        <<TESTBUILD
#!/bin/sh

#Builds of the sigalrm test case:
g++ tst-sigalrm.cpp -o tst-sigalrm-shared
g++ -static tst-sigalrm.cpp -o tst-sigalrm-static

echo " * Shared run :"
tst-sigalrm-shared
echo " * Static run :"
tst-sigalrm-static
TESTBUILD
chmod a+x "tst-sigalrm-build"

I suspect some 'mismatch' in the static libraries that uses some '32bit' somewhere and when the 'retq' pops back the return adress, it's totally wrong (it's my guess, but I have no clue, after several weeks of debugging with ddd/google/glic mailing list/LQ programming forums)

That's so stupid... I need alarm to make my cursor to blink ! :)

Thanks

Garry.

NoStressHQ 04-23-2010 05:59 PM

Forgot to mention...
 
Sorry I forgot to mention... The same code was doing good on slackware 32, I just encountered it switching my system to slackware 64 (I use -current branch).
I had some feedback from ubuntu 64 users being able to run it without problem, but I have no guarantee at the time that they really tried static compilation.

Cheers.

gnashley 04-24-2010 01:15 PM

You may need '-fPIC' in the g++ options.

NoStressHQ 04-25-2010 05:39 AM

Quote:

Originally Posted by gnashley (Post 3946434)
You may need '-fPIC' in the g++ options.

Hey thank you,
I tried, and it didn't change anything, still crash at the same exact place for the same reason.

Anyway I didn't believed it was that, because as I mentionned : generated code is 64bit (it's a retQ seen in the debugger...) and it's working without problem in shared model which doesn't change "anything" but the glibc library version used to link. (And I checked gcc target config which is, as expected, default on x86_64...)

So am I the only one who got this piece of code crashing on slackware64-current ?

Cheers

Garry.

bgeddy 04-25-2010 10:42 AM

Quote:

So am I the only one who got this piece of code crashing on slackware64-current ?
I haven't tried this on current but out of curiosity I tried it on Slackware64 -13 and it crashes the same with a segmentation fault. The shared lib version runs fine.

I'm afraid I don't know much about building statically and so can't be of any guidance.

GazL 04-25-2010 04:01 PM

I can confirm a segfault on 64-current.

I tried converting it to pure C and using gcc rather than g++ but it does exactly the same thing.
I also tried changing it to use the slightly simpler action.sa_handler invocation rather than using action.sa_sigaction but again, it gives exactly the same segfault issue when built statically.

Some sort of libc bug perhaps?

GazL 04-25-2010 09:44 PM

I've done a little more digging on this one. I worked with my C version of the code, which is slightly different to the OPs, but show the same symptoms (I'm not much good with C++)

gazl-sig.c:
Code:

#include <sys/time.h>
#include <signal.h>
#include <stdio.h>

static int alarmed;
struct sigaction action,oldaction;

void onAlarmSignal( int signal)
{
  printf("Tick!\n");
  ++alarmed;
}

void registerSignal()
{
  action.sa_handler=onAlarmSignal;
  sigemptyset(&action.sa_mask);

  sigaction(SIGALRM,&action,&oldaction);
}


void startTimer()
{
  struct itimerval value;
  value.it_interval.tv_sec=0;
  value.it_interval.tv_usec=100;
  value.it_value=value.it_interval;
  setitimer(ITIMER_REAL, &value, NULL);
}

int main( int argc, const char *argv[])
{
  alarmed=0;
  registerSignal();
  startTimer();

  do
    ;
  while (alarmed <10);

  return 0;
}

Then I compiled it with:
gcc -Wall -O0 -g -static gazl-sig.c -o sig-static

Now for the interesting bit... gdb time.
Code:

(gdb) run
Starting program: /tmp/sig-static
Tick!

Program received signal SIGSEGV, Segmentation fault.
0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12      }
(gdb) disassemble
Dump of assembler code for function onAlarmSignal:
0x00000000004002ac <onAlarmSignal+0>:  push  %rbp
0x00000000004002ad <onAlarmSignal+1>:  mov    %rsp,%rbp
0x00000000004002b0 <onAlarmSignal+4>:  sub    $0x10,%rsp
0x00000000004002b4 <onAlarmSignal+8>:  mov    %edi,-0x4(%rbp)
0x00000000004002b7 <onAlarmSignal+11>:  mov    $0x46f824,%edi
0x00000000004002bc <onAlarmSignal+16>:  callq  0x4010c0 <puts>
0x00000000004002c1 <onAlarmSignal+21>:  mov    0x296009(%rip),%eax        # 0x6962d0 <alarmed>
0x00000000004002c7 <onAlarmSignal+27>:  add    $0x1,%eax
0x00000000004002ca <onAlarmSignal+30>:  mov    %eax,0x296000(%rip)        # 0x6962d0 <alarmed>
0x00000000004002d0 <onAlarmSignal+36>:  leaveq
0x00000000004002d1 <onAlarmSignal+37>:  retq 
End of assembler dump.

The segfault is occurring on the retq instruction.

The next question is where is it trying to return to? So, lets set a breakpoint and examine the return address on the stack just before it tries to return:
Code:

(gdb) break *0x4002d1
Breakpoint 1 at 0x4002d1: file gazl-sig.c, line 12.
(gdb) run
Starting program: /tmp/sig-static
Tick!

Breakpoint 1, 0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12      }
(gdb) x/a $rsp
0x7fff1800d3f8: 0xf0000000fc0c748
(gdb) x 0xf0000000fc0c748
0xf0000000fc0c748:      Cannot access memory at address 0xf0000000fc0c748
(gdb)

I may be misinterpreting it, but it looks like it's trying to return to the middle of nowhere.

The shared lib version seems to work in a completely different manner and actually returns to some executable code on the stack (I guess that's how shared libraries work).

Anyway, that's about as far as I can go with this. I don't have the knowledge to dig any deeper.

NoStressHQ 04-26-2010 01:41 AM

Thanks all for your feedback.

So some simple things :
- it happens only on 64 bit slackware (being tested succesfully on ubuntu 64)
- it happens only on static link.
- it happens on a system call
- it happens systematically, and even with an empty function, so no 'memory/stack/buffer override' here.

My obvious guess is that the 'retq' as not the same 'size' of the call (it's been called by a 'call' not a 'callq'). That would explain the totally broken pointer.

I'm missing some 'underground' knowledge here so I 'guess' that the caller is the kernel himself. Then if I don't mistake, kernel supports 32bit binaries 'as well' (silently) (nothing related to 'external shared libraries', I'm talking on a static linkage point of view). But to do the 'right call' (calling as the signal as 32bit handler or 64bit handler) there might be 'somewhere' where the kernels get this info.

I mean... "simply"... I think the 64bit kernel can handle both 32bit and 64bit processes... I think that the statically linked binary might be tagged as '32bit' somewhere... But a dump of elf infos still shows a 64bit binary (ld does a good job)... So that might be when the glibc registers the sigaction somewhere or something that is done directly by the buggy compiled process that send the kernel wrong informations.

I guess most of this glue code to be in the glibc, and/or tightly coupled with some gcc crtX.o runtime.
I've tried to look at the 'gcc' package slackbuild and it seemed alright, I mean it should take care of 64bit (and it takes care of that for shared libs), and from first observation, it should do what expected. But I can't help suspecting the static glibc libs to be built with some wrong option...

So I don't have any new way to look into, I have this 'guess' but don't know how to prove/unprove it. And don't know how to find the 'guilty one' in that chain.

Is there anybody working on the Slackware x86/64 build around ?

It might just be a slackbuild 'hack' to do.

Thank you all for the support.

Cheers

Garry.

bgeddy 04-26-2010 12:15 PM

Just some more information to be going on with as this intrigues me.

I have turned the source into an Eclipse cpp project and put in the appropriate -static linker flags for Eclipse. This builds a statically linked executable, (as,just to be certain, "file my_alarm" confirms for me). The resulting binary seg faults as usual when ran from cli but runs OK from within Eclipse IDE! Hmm, strange.

NoStressHQ 04-26-2010 06:50 PM

Quote:

Originally Posted by bgeddy (Post 3948223)
Just some more information to be going on with as this intrigues me. [...]

Thanks for the help. This makes me ask if there were 'something' different from a process spawn and a CLI launch. I mean, I thought that the kernel, somehow (elf infos?) get the binary 'bit size'. I assume that the binary you're using is the same whereas you launch it from Eclipse or CLI... So... Is there a way for the calling process to tell the system in which 'bit depth mode' the binary is ? Maybe it's the 'fork/exec' pair that copy the eclipse's 64bit 'flags' to its child process, and could bypass the elf infos baked in binary ? (Big guess here, but the fork would explain that, on the other hand the shell forks too to launch a binary...)

This is an interesting new behavior, yet it's still a mistery ! :)

Cheers

Garry.

salemboot 04-27-2010 12:57 AM

I was looking at your code.

Looks like you may be relying on the compiler to fix up your code.

You're writing c synatax in a c plus plus compiler.

using namespace std;

namespace foo
{

void main( void )
{
cout << "my message";
}
}


Chances are the C compiler or the glibc could be broken though. I remember back in the day there was a return error that needed patching when you upgraded GCC. EGCS or something.

This machine is Ubuntu 9.10 so it's compiler version is


# gcc -v

Thread model: posix
gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)

Check your version and check google to see if there are reports for that version of the compiler.

NoStressHQ 04-28-2010 03:49 PM

I've run a test to check my guess... I thought the caller was doing some 32bit call to the 64bit callback...

So I 'hacked' the callback this way :

Code:

        void        _onAlarmSignal(int        signal,siginfo_t* sigInfo,pvoid pUContext) {
                printf("Tick !\n");
                ++alarmed;

                asm        (        "leaveq\n\t"
                                "retw\n\t"
                        );
        }

This force a 32bit ret, but it doesn't fix the crash... So, my guess was wrong. It's not related to a call size mismatch...

Anyone else for some clue here ?

Note: You can write this code in ASM it'll still crash... That problem is not a 'religious syntax problem' C++ vs C or whatever, it's about static standard library build... It's a 'system programming' problem, not a "I don't know how to write this code". This is a bug test case, doesn't represent the 'real life code' of course.

Thanks !

Cheers.

Garry.

---- If it can be usefull ----
Target: x86_64-slackware-linux
Configured with: ../gcc-4.4.3/configure --prefix=/usr --libdir=/usr/lib64 --enable-shared --enable-bootstrap --enable-languages=ada,c,c++,fortran,java,objc --enable-threads=posix --enable-checking=release --with-system-zlib --with-python-dir=/lib64/python2.6/site-packages --disable-libunwind-exceptions --enable-__cxa_atexit --enable-libssp --with-gnu-ld --verbose --disable-multilib --target=x86_64-slackware-linux --build=x86_64-slackware-linux --host=x86_64-slackware-linux
Thread model: posix
gcc version 4.4.3 (GCC)
--------------------------------

GazL 04-28-2010 06:05 PM

Ok, found out a little more.
By debugging the shared version of the program I've found that the return address on the stack
points to symbol __restore_rt in libc.so.6:
Code:

(gdb) c
Continuing.
Tick!

Breakpoint 2, 0x0000000000400601 in onAlarmSignal (signal=1) at gazl-sig.c:12
12      }
(gdb) x/a $rsp
0x7fff703940b8: 0x7fc42871d450 <__restore_rt>
(gdb) info shared
From                To                  Syms Read  Shared Object Library
0x00007fc428a5ba90  0x00007fc428a73ed4  Yes        /lib64/ld-linux-x86-64.so.2
0x00007fc4287087e0  0x00007fc42880a4a4  Yes        /lib64/libc.so.6
(gdb)

Now when we look at the return address in the statically linked program, we can see
that __restore_rt is at a different location than that of the return address on the top of the stack (which actually looks like an int to me rather than an address):
Code:

(gdb) c
Continuing.
Tick!

Breakpoint 2, 0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12      }
(gdb) x/a $rsp
0x7fffa3a31a78: 0xf0000000fc0c748
(gdb) info address __restore_rt
Symbol "__restore_rt" is at 0x400af0 in a file compiled without debugging.
(gdb) info symbol __restore_rt
__restore_rt in section .text

Now, if I manually change that return address on the stack to point to __restore_rt, the program seems to continue correctly (for 1 iteration, at which point the stack has the wrong value again):
Code:

(gdb) set {long} 0x7fffa3a31a78 = __restore_rt
(gdb) x/a $rsp
0x7fffa3a31a78: 0x400af0 <__restore_rt>
(gdb) c
Continuing.
Tick!

Breakpoint 2, 0x00000000004002d1 in onAlarmSignal (signal=1) at gazl-sig.c:12
12      }
(gdb)

Quite why the return address on the stack isn't pointing at "__restore_rt" I have no idea, but that looks like it's what's going wrong.

bgeddy 04-28-2010 09:36 PM

Yes, I figured the stack must be somehow getting messed and trashing the return. I have not, however, been able to give this as much attention today as I had hoped as my development environment has got trashed and needs fixing, (a long story - suffice to say Eclipse can be an absolute nightmare). It would be nice to pinpoint what was causing the frame to get messed up like this.

Nice one for the detective work and keeping us posted !

NoStressHQ 05-03-2010 11:23 PM

Hi,

Thanks for the trace, that's effectively what I got too.

Meanwhile I tried to find some informations.

restore_rt is a special address used by glibc (look in 'signal.c' of the appropriate architecture). I've read that when you do a kernel call, that symbol is 'inserted' in the stack as a return address for the signal handler. Sorry I can't find where I've read that. But you should be able to find this info if you look around "signal" "restore_rt" and such keywords.

Also, as I have a lot of statically compiled programs, not requiring signals, I've found that trying to trace such a program with gdb made gdb freezes quite quickly (you might need two sources and a call from main)... So those programs are working well (if no bug ;) ) because they don't do 'signals', but when trying to trace (if a bug :( ) gdb quickly freezes. First I thought it was ddd, but CLI gdb does the same. (EDIT: After some more tests I'm not 100% sure about that, it seems GDB just takes ages sometimes, but still it's far longer that what I experienced on Slack32).

I'm pretty sure that this thread talks about exactly the same problem (but with no solution) : http://www.gossamer-threads.com/lists/openssh/dev/47519

So I still think that 'somehow' the static build of the glibc libraries are somehow broken (maybe vs static build of gcc+gdb, and so on...).

It seems that if we don't find it ourselves, we're stucked :).

I sincerely think that even if 'static build' is not so common nowadays, it should works, there are quite some situations that requires it.

So we have to debug our Slackware64 build not to be ashamed by Ubuntu users ;).

Cheers

Garry.


All times are GMT -5. The time now is 02:39 AM.