SEGMENTATION FAULT using gcc 4.4.4 -O2 , works with gcc 4.1.0 -O2 or gcc 4.4.4 -O1
Hi,
I'm trying to update a relatively old software to be used with new 64-bit systems and also new version of gcc. Becuase the original software is written for 32-bit systems, I decide to use controlled data types which are the same on both 64 bit and 32 bit Machines. I change the code according to these new defined types, Here is the situation: 1- As I expect everything works on 32-bit machine. 2-If I use gcc 4.1.0 on 64-bit Machine everything is working 3-If I use gcc 4.4.4 on 64-bit Machine Segmentation Fault would occur! (Optimization O2) 4-If I use gcc 4.4.4 on 64- bit Machine with -O everything works!! Here is the output of Valgrind: ==13412== ==13412== ==13412== Process terminating with default action of signal 11 (SIGSEGV) ==13412== Access not within mapped region at address 0x700000008 ==13412== at 0x409F29: SysString::clear(Integral::CMODE) (sstr_03.cc:1623) ==13412== by 0x40B73D: SysString::assign(unsigned char, wchar_t const*) (sstr_03.cc:838) ==13412== by 0x403474: SysString::diagnose(Integral::DEBUG) (sstr_02.cc:221) ==13412== by 0x401E1C: main (in /usr/local/isip/tools/ifc/class/system/SysString/SysString.exe) ==13412== If you believe this happened as a result of a stack ==13412== overflow in your program's main thread (unlikely but ==13412== possible), you can try to increase the size of the ==13412== main thread stack using the --main-stacksize= flag. ==13412== The main thread stack size used in this run was 16777216. ==13412== ==13412== For counts of detected and suppressed errors, rerun with: -v Why I am getting Segmentation Error in case 3? tnx |
Hi -
As I'm sure you know, just because some code happens to run without crashing, doesn't necessarily mean that code is "correct". There could have been a latent bug there since Day One. On the other hand (as Valgrind is reporting), maybe you're getting a stack overflow. Certainly worth instrumenting and looking for: It looks like the code in question is trying to emulate Windows MFC functionality (which, itself, is probably fraught with danger ;)). STRONG SUGGESTION: 1. See if you can reproduce the problem with "-g" 2. If so, see if you can troubleshoot whether your input values are correct and your data structures are uncorrupted, and your stack OK under GDB. 3. You might also be interested in using libsigsegv() for your troubleshooting: http://savannah.gnu.org/projects/libsigsegv/ 'Hope that helps .. PSM |
Quote:
-Wall -Wextra -Wformat=2 during compilation - sometimes warnings produced by the compiler give the clue. |
Quote:
Actually, I have checked different stack sizes already and it is not helping. This code is a part of a bigger code and is relatively complex system :D I have complied with -g and it is the result of gdb: gdb) r Starting program: /usr/local/isip/tools/ifc/class/system/SysString/SysString.exe diagnosing class SysString: testing required public methods... <SysString::str1> value_d = (16 >= 16) "hello my name is" <SysString::str2> value_d = (4 >= 4) "rjck" <SysString::str3> value_d = (100 >= 0) "" <SysString::str4> value_d = (4 >= 4) "rjck" testing class-specific public methods: extensions to required methods... Program received signal SIGSEGV, Segmentation fault. SysString::clear (this=0x7f00000000, cmode_a=Integral::RESET) at sstr_03.cc:1623 1623 if (capacity_d > 0) { Current language: auto; currently c++ (gdb) I will work with libsigsegv to see if it can help or not!!! anyway, thanks for the reply |
Quote:
|
Quote:
To find out whether it is a strict-aliasing problem, replace the -O2 option with -O2 -fno-strict-aliasing If that fixes it, the problem was probably strict-aliasing (though that wouldn't be certain). If -fno-strict-aliasing doesn't fix the problem, then the problem definitely wasn't strict-aliasing. If the problem is strict-aliasing, it is best to find and fix that error each place where it occurs in your code. But it large old programs that usually isn't practical, so -fno-strict-aliasing becomes a long term part of your compile command. |
Quote:
|
Quote:
Seg faults should be pretty easy to understand when you catch them this way in GDB. The this pointer 0x7f00000000 looks a little improbable, but not definitely wrong. GDB commands can be used to examine the *this object and/or the contents of memory at 0x7f00000000 to see whether that pointer is wrong. I don't know whether your Valgrind results were run with the same addresses used as your GDB results. The faulting address reported by Valgrind 0x700000008 seems quite unlikely for that line of code (a simple read of capactity_d) and the GDB reported value of the this pointer. If you post a bit more of the source of SysString::clear, that might make the problem obvious. If you know any asm, it is very effective to look at some disassembly and register values in GDB at the point of the seg fault. The seg fault means some address was bad. You need to figure out what address was bad and what the code was supposed to be doing with that address and why it had a wrong value instead. All that should be pretty easy to find in GDB at the point of the seg fault. |
Quote:
Here is the snap of the code: Code:
// method: clear Code:
// -------------------------------------------------------------- Actually the error tends to move, for example if I comment out some part of the code it would appear somewhere else! output of gdb and backtrace: gdb) r Starting program: /usr/local/isip/tools/ifc/class/system/SysString/SysString.exe diagnosing class SysString: testing required public methods... <SysString::str1> value_d = (16 >= 16) "hello my name is" <SysString::str2> value_d = (4 >= 4) "rjck" <SysString::str3> value_d = (100 >= 0) "" <SysString::str4> value_d = (4 >= 4) "rjck" testing class-specific public methods: extensions to required methods... Program received signal SIGSEGV, Segmentation fault. SysString::clear (this=0x7f00000000, cmode_a=Integral::RESET) at sstr_03.cc:1623 1623 if (capacity_d > 0) { Current language: auto; currently c++ (gdb) backtrace #0 SysString::clear (this=0x7f00000000, cmode_a=Integral::RESET) at sstr_03.cc:1623 #1 0x000000000040b73e in SysString::assign (this=0x7f00000000, arg_a=27 '\033', fmt_a=<value optimized out>) at sstr_03.cc:838 #2 0x0000000000403475 in SysString::diagnose (level_a=<value optimized out>) at sstr_02.cc:221 #3 0x0000000000401e1d in main () (gdb) |
Quote:
Try setting a breakpoint earlier and see where this pointer is coming from. |
Quote:
An error that moves like that, usually is a memory clobber bug: The code with the actual bug uses some memory that doesn't belong to it. Then the error appears when the section of code that does own that memory uses it. A memory clobber bug usually needs to be backtracked in two stages. First you need to follow the bad value (the this pointer in your example) back to the memory location where it was clobbered. Then you need to restart and set a data breakpoint to catch the real bug (In GDB I don't know how, nor even the correct terminology. I'm usually chasing such bugs in Visual Studio). The info you posted makes it much more likely that the this pointer is bad (otherwise GDB is wrong about the line number, which is possible, but less likely). You also showed that the this pointer came through SysString::assign. So you should be looking in SysString::assign, or more likely the code that called it, for the point where this got clobbered. |
I have found something that might be related to the problem :
If use gdb and put a breakpoint just before the segmentation fault occurs in sstr_02.cc (at line 220) and then examine the value of "value_d" (value_d is a pointer to unichar) and then go one step into assign function and examine the "value_d" again I see this: (gdb) r Starting program: /usr/local/isip/tools/ifc/class/system/SysString/SysString.exe testing class SysString diagnosing class SysString: testing required public methods... <SysString::str1> value_d = (16 >= 16) "hello my name is" <SysString::str2> value_d = (4 >= 4) "rjck" <SysString::str3> value_d = (100 >= 0) "" <SysString::str4> value_d = (4 >= 4) "rjck" testing class-specific public methods: extensions to required methods... Breakpoint 1, SysString::diagnose (level_a=<value optimized out>) at sstr_02.cc:221 (gdb) p num.value_d $3 = (unichar *) 0x61fcb0 (gdb) s SysString::assign (this=0x7f00000000, arg_a=27 '\033', fmt_a=0x41afd8) at sstr_03.cc:818 (gdb) p value_d $4 = (unichar *) 0x0 (gdb) on 32 bit system it like this: (gdb) r Starting program: /home/amir/local/isip/tools/system-ifc/class/system/SysString/SysString.exe testing class SysString diagnosing class SysString: testing required public methods... testing class-specific public methods: extensions to required methods... Breakpoint 1, SysString::diagnose (level_a=Integral::BRIEF) at sstr_02.cc:221 (gdb) p num.value_d $3 = (unichar *) 0x81672e8 L"27" (gdb) s SysString::assign (this=0xbfffeb84, arg_a=27 '\033', fmt_a=0x8063ac0 L"asdf = %u xyz") at sstr_03.cc:828 (gdb) p value_d $4 = (unichar *) 0x81672e8 L"27" (gdb) As you can see for some reason "value_d" is pointing the NULL in the first case which is wrong, How this could happen? |
Quote:
Meanwhile, there is something strange in what you just provided. Can you explain this: In your 64 bit version line 221 in SysString::diagnose called a version of SysString::assign at line 818. But in your 32 bit version line 221 in SysString::diagnose called an apparently different version of SysString::assign at line 828. If you don't have a good explanation for that, post the area around each of those lines (around 221 in sstr_02.cc as well as around 818 through 828 in sstr_03.cc). |
Quote:
Here is the code: bool8 SysString::assign(byte8 arg_a, const unichar* fmt_a){<---Line 818 // allocate a static buffer for printing // static char buf[MAX_LENGTH]; static char fmt[MAX_LENGTH]; static char* fmt_ptr; // check the arguments // if (fmt_a == (unichar*)NULL) { <---- Line 828 return Error::handle(name(), L"assign", Error::ARG, __FILE__, __LINE__); } SysString temp(fmt_a); temp.getBuffer((byte8*)fmt, MAX_LENGTH); fmt_ptr = fmt; // clear out the current value // clear(Integral::RESET); // create and possibly assign the string // if (sprintf(buf, fmt_ptr, (uint32)arg_a) > 0) { assign((byte8*)buf); return true; } // exit gracefully // return false; } I think it is a gdb issue that shows line 828 instead of 818 |
OK, now I see I misunderstood GDB output regarding 818 vs. 828. That is just a difference in the optimizer behavior of the two compiles.
I don't know how much to trust GDB regarding the values of this and value_d when stopped at line 818. Generally I don't trust any implausible variable values reported by GDB. GDB and/or the compiler are not very good at tracking which variables are in which registers and/or stack locations at which lines of the source code. zirias expressed the opinion (that I mostly share) that 0x0x7f00000000 is an unreasonable value for this. You told me that value_d is a member of SysString so at line 818 value_d should be equivalent to this->value_d which (assuming this is invalid) should have been Cannot access memory at address rather than $4 = (unichar *) 0x0 If I were debugging it, I would poke around a bit more at that point to find out which, if any, of the apparently contradictory pieces of info represent the result of the bug you're looking for, vs. which represent wrong info displayed by GDB. At 818 and maybe at an s further into that function, I would want to know what is: this &value_d this->value_d If those don't start to add up to something consistent, I'd look at disassembly of the code at that point and at register values and also try directly looking at memory at address 0x7f00000000 |
Quote:
221 num.assign(dbyte, L"asdf = %u xyz"); Current language: auto; currently c++ (gdb) p &num $10 = (SysString *) 0x7fbfffe2a0 (gdb) p num.value_d $1 = (unichar *) 0x61fcb0 (gdb) s SysString::assign (this=0x7f00000000, arg_a=27 '\033', fmt_a=0x41afd8) at sstr_03.cc:818 818 bool8 SysString::assign(byte8 arg_a, const unichar* fmt_a) { (gdb) p this $2 = (SysString * const) 0x7f00000000 (gdb) p &value_d $3 = (unichar **) 0x7f00000000 (gdb) p this->value_d $4 = (unichar *) 0x0 (gdb) s 828 if (fmt_a == (unichar*)NULL) { (gdb) p this $5 = (SysString * const) 0x7f00000000 (gdb) p &value_d $6 = (unichar **) 0x7f00000000 (gdb) p this->value_d $7 = (unichar *) 0x0 (gdb) Now what? Should not &num and this point to the same memory? |
That's a bit of a surprise. 0x7f00000000 seems to be a valid address.
So that makes the original seg fault less plausible. Your post #9 makes it look like there was a seg fault at Code:
SysString::clear (this=0x7f00000000, cmode_a=Integral::RESET) at sstr_03.cc:1623 The latter should be easy to determine by just setting a breakpoint there and proceeding to it and seeing what this and capacity_d and &capacity_d all are. Assuming GDB might be wrong about the line number of the seg fault, we'd also like to know what value_d is at that point. Edit: Sorry, I wasn't thinking clearly. We already know value_d was bad before it reached there, so we can reasonably assume GDB is wrong about the line number of the seg fault and you need to look earlier not later to find out when/why value_d was clobbered. |
Quote:
the address of num is 0x7fbfffe2a0 and just after calling it changes What kind of things could do this? |
Quote:
Now, I assume you showed us that because you think num at line sstr_02.cc:221 is the same object as *this at line sstr_03.cc:818. So you should validate that by having gdb give the vale of &num at line sstr_02.cc:221 Edit: you answered that while I was asking the question. |
Quote:
Quote:
That is exactly the kind of bug typical of an error in porting 32 bit code to 64 bit. This could easily be caused by the immediately preceding object in memory being stored as 64 bits into a 32 bit allocated space overwriting the next 32 bits with zero. Note I mean the object preceding the pointer to num, not the object preceding num itself. I could be a lot more specific if I saw all the code from the declaration of num through line 221. |
Quote:
Breakpoint 1, SysString::diagnose (level_a=Integral::BRIEF) at sstr_02.cc:221 221 num.assign(dbyte, L"asdf = %u xyz"); (gdb) p &num $3 = (SysString *) 0xbfffebc4 (gdb) s SysString::assign (this=0xbfffebc4, arg_a=27 '\033', fmt_a=0x8063ac0 L"asdf = %u xyz") at sstr_03.cc:828 828 if (fmt_a == (unichar*)NULL) { (gdb) p this $4 = (SysString * const) 0xbfffebc4 (gdb) so &num and this are the same on 64bit: Breakpoint 1, SysString::diagnose (level_a=<value optimized out>) at sstr_02.cc:221 221 num.assign(dbyte, L"asdf = %u xyz"); (gdb) p &num $15 = (SysString *) 0x7fbfffe2a0 (gdb) s SysString::assign (this=0x7f00000000, arg_a=27 '\033', fmt_a=0x41afd8) at sstr_03.cc:818 818 bool8 SysString::assign(byte8 arg_a, const unichar* fmt_a) { (gdb) p this $16 = (SysString * const) 0x7f00000000 (gdb) This shows that the address for num has changed , right? |
Quote:
|
Quote:
Code:
// make sure that an empty string fails |
That makes it look like the call to num.get(dbyte_v); did the harm.
Is the source code to get posted yet? num is a local variable in the current stack frame. GDB deduces its address (at line 221) from the rbp register. That register must not be corrupted or gdb would be totally confused at that point. The only way to get this symptom is if the optimizer had put the address of num into a callee saved register (because it is used so much) rather than recompute it from ebp each time. That's strange because recomputing it from ebp is nearly free compared to simply copying it into edi (all that would make sense if you knew x86_64 asm). |
Quote:
Code:
// method: get |
I'm pretty sure this is the bug:
Code:
uint32 val = 0; val is only 32 bits. If you don't understand, post the definition of DEF_FMT_LONG_8BIT and I can explain more specifically. In architectures where uint32 is the same size as long, this code works. In architectures where long is bigger, this code clobbers the low half of a register saved by the entry into this function. It then returns to the caller with the wrong value in a register. Whether/how that matters depends on details of the optimization in that calling function. As I started to explain in post #24, I had deduced that the optimizer placed an extra copy of the address of num in a callee saved register that happens to be the register clobbered by the bug in get. |
// constants: 8-bit version of the default format strings (for efficiency)
// const char SysString::DEF_FMT_VOIDP_8BIT[] = "%p"; const char SysString::DEF_FMT_ULONG_8BIT[] = "%lu"; could it be the problem? Actually your comment make sense to me but I wonder if this could make segmentation fault? |
I think I can correct this part and then I will report the result here, thanks your comment is really make sense now after some thinking ;)
|
Quote:
DEF_FMT_LONG_8BIT, but you posted two other format strings. Anyway, this is the problem. There is a similar problem in more than one of your overloads of get. So fix it in each place, not just in the one that is causing this specific seg fault. |
Quote:
|
Quote:
Do you think if I remove all 'l''s from formatting strings can I be hopeful that this code would not break again for this reason? |
Quote:
Code:
printf("some_int_var=%ll\n", (long long)some_int_var); |
Quote:
bool8 SysString::get(int16& val_a) const { // declare local variable // int32 tmp_val = 0; // use the 8-bit character conversion // if (sscanf((char*)(byte8*)(*this), (char*)DEF_FMT_LONG_8BIT, &tmp_val) != 1) { return false; } // set the output // val_a = tmp_val; // exit gracefully // return true; } Generally I want to specify the type of my variables to use certain number of bits so these variables should be exactly the same on all Machines. For formats we have: const char SysString::DEF_FMT_LONG_8BIT[] = "%ld"; Now if change formats to for example this: const char SysString::DEF_FMT_LONG_8BIT[] = "%d"; Can I be hopeful to get the same result for example on 128-bit machines too? I mean is it likely that "%d" definition change again? |
Quote:
In your design, you want val_a to be an explicit size independent of architecture. That makes sense. But you are being too strict in deciding tmp_val is also an explicit size independent of architecture. tmp_val exists only to interface to sscanf. sscanf does not work with explicit sizes independent of architecture. So you should change your objective to just make sure tmp_val is at least as big as val_a. There are lots of ways of doing that while declaring tmp_val as some architecture specific size that is compatible with sscanf. Mainly your problem is wrapped up in your choice of using sscanf at all. This is C++ code. sscanf is a lame holdover from C. If you were using some kind of stringstream as the text side source and using operator>> instead of sscanf then the operator overloading of streams would fit the format to the destination automatically rather than requiring all this work on your part to do so. I have no idea what a 128 bit architecture would look like. Lots of different things are different sizes in each architecture. But the virtual address size has been the primary driver of the size naming of architectures. You might think that the exponential growth from 16 bit virtual addresses to 32 bit virtual addresses to 64 bit virtual addresses would logically continue to 128 bit. But it won't: 16 bit virtual addresses were already too small when 16 bit x86 was introduced and were horribly too small by the time 32 bit x86 was introduced. 32 bit virtual addresses were plenty large enough when introduced and were still mostly large enough when 64 bit was introduced. 32 bits were closer to adequate when 64 bits were introduced than 16 bit was when 16 bit x86 itself was introduced. In that sense the available addressing doubled twice while the required addressing only really doubled once. Then the exponential growth in problem size just needs an exponential growth in memory, which is only a linear growth address size. So the jump from 32 to 64 was is another way twice the jump from 16 to 32. So 64 bit addressing should be plenty for at least four times longer than 32 bit addressing was plenty. So I think you're trying too hard to guess distant future portability issues. |
Quote:
Anyway, thanks a lot. All of you ,specially john, were very helpful. |
Quote:
Secondly, you need fixed sizes only if/when you deal with HW generated data, e.g. if/when you deal with, say, Ethernet packet. I don't see a case when one needs "bool8" (taken from http://www.linuxquestions.org/questi...ml#post4046144 ). I.e. I would give the compiler to choose width of 'bool' type. Anyway, if you want to complicate things, you still can use constructs like Code:
if(sizeof(my_int_var) == sizeof(int)) This can be scripted (i.e. C++ code can be generated by a script) and can probably be implemented through templates. Still, rethink the whole issue of imposed size variables. |
Edit - Misread (can't delete?)
|
All times are GMT -5. The time now is 08:24 PM. |