Segmentation Fault

sahil.jammu · 07-06-2009, 01:17 PM

Hi All,

I was working on test cases which was passing successfully on 32 bit RHEL. Recently there was a upgrade to system and its upgraded to 64 bit. Some of the test cases started failing now with segmentation fault and core dump is getting generated.

Seems like pointers are behaving strangely, i am new to this and haven't looked into any core dump. Can anyone provide pointers on this, what would have went wrong , as per my understanding segmentation fault is coming as program reads memory outside allotted memory buffer space, which indicates memory leak or invalid pointers leading to corrupt memory address to read and write. Any specific difference in pointers behavior in 32 bit OS and 64 Bit OS

I used ddd to look into core dump, i opened it and did a back trace to understand the flow of the code,

In crash line (Filename_c:492) debugger shows this value:
(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0

But not this one:
gdb) print phch_last->next
Cannot access memory at address 0x44

Kindly provide your views.

Regards
Sahil

johnsfine · 07-06-2009, 01:36 PM

One important detail that isn't clear from your post:

Did you run a 32 bit binary of your program under the 64 bit kernel? Or did you recompile your source code into a 64 bit binary and run that?

Quote:

Originally Posted by sahil.jammu

Seems like pointers are behaving strangely, i am new to this and haven't looked into any core dump.

The first thing one usually looks at in a core dump is the call stack (the list of which function called which down to the point of failure). If you know how your code is supposed to function, that is usually enough to spot the bug.

Quote:

program reads memory outside allotted memory buffer space

Technically true, but that is a misleading way to describe it.

Quote:

which indicates memory leak or invalid pointers leading to corrupt memory address to read and write.

Invalid pointer (not a memory leak).
But that is just a symptom. The trick is to find the cause of the invalid pointer.

Quote:

Any specific difference in pointers behavior in 32 bit OS and 64 Bit OS

Pointers do exactly what the source code tells them to do. The pointer is inside the application program. It is not influenced by the OS.

A 64 bit OS lets you run 64 bit programs. It doesn't require that you run 64 bit programs.

There are lots of ways that programmers may have accidentally or intentionally assumed that a pointer is 32 bits. Recompile for 64 bits and the assumptions are wrong and the code will crash.

There are fewer ways (but not zero) that running a 32 bit binary under a 64 bit kernel would violate similar bad assumptions made by the programmer.

Quote:

Originally Posted by sahil.jammu

(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0

But not this one:
gdb) print phch_last->next
Cannot access memory at address 0x44

I don't remember gdb print syntax in enough detail to be sure, but that all seems to mean:

phch_iter points to valid memory
phch_iter->next contains a 0 (meaning it does not point to a valid next object).
phch_last does not point to valid memory.
The address of phch_last->next is 0x44.

That last detail is strange, especially if you recompiled for 64 bit mode. In such situations, one would expect phch_last to be exactly zero (though checking that is better than guessing it), which means the field "next" is at offset 0x44 in the structure, which violates the usually packing rules for x86_64.

sahil.jammu · 07-06-2009, 01:46 PM

Hi johnsfine

I have both the environments available 32 bit and 64 bit. Same test case i compiled in RHEL 32 bit and executed - Passed
and similar test cases compiled and executed in 64 bit environment is Failed (inconclusive) .

I tracked the place where the debugger is complaining about the code using ddd backtrace, but i am not very clear what is the expected behavior as the passing scenario is of 32 bit testcase (not sure how to check the value of variable there since there is no coredump in passing testcase for analyse).

In crash line (Filename_c:492) debugger shows this value:
(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0

But not this one:
gdb) print phch_last->next
Cannot access memory at address 0x44

Thanks for providing clarifications in other areas.

Regards
Sahil

johnsfine · 07-06-2009, 02:00 PM

One option you may not realize (but it isn't actually difficult): You can take the 32 bit binary compiled on the 32 bit RHEL and run it on the 64 bit RHEL.

Probably there will be a few dependencies missing at first.

Those are 32 bit .so files that can be installed on the 64 bit RHEL but by default aren't.

You could use the ldd command on your 32 bit binary to predict which extra .so files you need, or simply try to run it and look at the error messages.

With the right Yum commands, you can easily install the 32 bit versions of each of the required dependencies on your 64 bit RHEL. Those coexist perfectly with the 64 bit versions you already have. (Any 64 bit binary will automatically find and load the 64 bit .so files, while 32 bit binaries will find and load the 32 bit .so files).

Recompiling your source code to a 32 bit binary on the 64 bit RHEL is also possible, but harder. If you just need to run the 32 bit binary on 64 bit RHEL, it is easiest to compile on your 32 bit RHEL and copy the binary over.

Quote:

Originally Posted by sahil.jammu

similar test cases compiled and executed in 64 bit environment is Failed

As I said before, there are many ways a programmer might accidentally or intentionally assume that pointers are 32 bit.

You don't sound like you are ready for the debugging task needed to find those bugs in your source code.

If you really need to get the 64 bit compiled code to work, I can give some more advice. But if it is a large amount of source code, there are probably several such errors and the first (easy looking) error has you stuck.

That is why I suggest running the 32 bit binary on the 64 bit RHEL.

Quote:

In crash line (Filename_c:492) debugger shows this value:
(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0

But not this one:
gdb) print phch_last->next
Cannot access memory at address 0x44

I don't know what you intended to communicate by repeating that.

I thought I had explained why the debugger responds as it did to those two print commands.
It might help to tell us the value of phch_last.

It probably would help to quote the definition of the data structure that should be pointed to by phch_last. I expect it is the same data type _WCDSP_PHCH_STATUS_STR that is pointed to by its "next" pointer.

It might help to quote a few lines of code around the failure point.

sahil.jammu · 07-08-2009, 12:16 PM

Hi Johnsfine,

Sorry for late reply, wasn't feeling well.

I tried doing back tracing using DDD from the point of failure:- phch_last->next = phch_iter->next;

Your analysis:-
-------
phch_iter points to valid memory
phch_iter->next contains a 0 (meaning it does not point to a valid next object).
phch_last does not point to valid memory.
The address of phch_last->next is 0x44.
-------

Values given by debugger:-
-----------------------

phch_last->next
(struct_WCDSP_PHCH_STATUS_ST ...ccess memory at address 0*44)

phch_iter
(WCDSP_PHCH_STATUS_STR*)
0*1140f198

phch_iter->next
(struct_WCDSP_PHCD_STATUS_STR *) 0*0

-----------------------

As you suggested the phch_iter -> next contains a zero, which reflects something must have went wrong at previous step, could be use of wrong data type, this same thing is working fine in 32 bit but not in 64 bit, so could be something to do with datatype, not sure- Just guessing.

From this line of failure i used up in ddd -> (select and prints the stack frame which called this one) , i got this result :-

----
wlfs_frame_rsp( (uint16*)lfs_msg_ptr, length )
----

If i go to this file and check the location of the code:-

Code:-
-----
case FDD_FRAME_RSP:
wlfs_frame_rsp( (uint16*)lfs_msg_ptr, length );
break;
-----

Any inputs from your side.

From main failure level there is no down result as it complains for segmentation fault. There are multiple Up value results which takes us to different functions that are been called .

Regards
Sahil

sahil.jammu · 07-08-2009, 12:19 PM

Missed out to give value of phch_last its (WCDSP_PHCH_STATUS_STR *) 0*20

sahil.jammu · 07-08-2009, 12:52 PM

Came Across this info:-
The uint* is primarily meant to store integer values. Most operations that manipulate arrays without changing their elements are defined. (Examples are reshape, size, the logical and relational operators, sub scripted assignment, and sub scripted reference.)
We can define your own methods for uint* (as we can for any object) by placing the appropriately named method in an @uint* directory within a directory on your path.

Also:-
usually Double data type (in C/C++), some of the other data representation that requires 64bits does work better and faster in 64 bits environment, however those smaller data (Int32 or Int16 and even Char) will get some penalty due to is longer space and instruction.

YOur inputs on mapping this information to our problem.

johnsfine · 07-08-2009, 01:02 PM

You seem to have ignored most of what I said.

I don't see anything in your recent posts that helps in diagnosing the problem further.

Quote:

Originally Posted by sahil.jammu

As you suggested the phch_iter -> next contains a zero, which reflects something must have went wrong at previous step

phch_iter->next containing zero might or might not by a symptom of some earlier problem. It might be perfectly correct. You haven't shown enough code for me to estimate that.

phch_last containing 0x20 (I think that is what you said) would certainly be a symptom of some earlier problem. It cannot be correct.

sahil.jammu · 07-08-2009, 01:43 PM

Hi Johnsfine,

Please find the code below:-
-------------
if (phch_iter->trch_first == NULL) {
if (phch_last == NULL) {
wcdsp_own_cell.wcdsp_phch_rx_status = phch_iter->next;
free(phch_iter);
phch_iter = wcdsp_own_cell.wcdsp_phch_rx_status;
} else {
phch_last->next = phch_iter->next;
phch_last = phch_iter;
free(phch_iter);
phch_iter = phch_last->next;
}
} else {
phch_last = phch_iter;
phch_iter = phch_iter->next;
}
}

phch_last containing 0x20 (yes its true as i checked the value using ddd)

As you mentioned this would certainly be a symptom of some earlier problem. It cannot be correct - - Can you plz provide some details about it, how you concluded this..??

Many thnx in Advance..!!

Regards
Sahil

johnsfine · 07-08-2009, 02:06 PM

Quote:

Originally Posted by sahil.jammu

Code:

                      phch_last = phch_iter;
                      free(phch_iter);
                      phch_iter = phch_last->next;

That is very bad code and it might be the bug causing your problem.
That code makes an assumption about the behavior of malloc that could easily change across architectures (such as i386 to x86_64) or even across versions of malloc.

The code frees the object pointed to by phch_iter, then it reads a member variable from that object. It is not correct to assume any part of the object is still valid after the object is freed.

The following code corrects the main problem in the above code, but it still is bad code:

Code:

                      phch_last = phch_iter;
                      phch_iter = phch_last->next;
                      free(phch_last);

In situations where the original code works the improved code would also work. In some situations where the original code breaks, this improved code would work.

But it still is keeping a pointer to a freed object, which hints there are related bugs in sections of the code you didn't shown.

Quote:

phch_last containing 0x20 (yes its true as i checked the value using ddd)

As you mentioned this would certainly be a symptom of some earlier problem. It cannot be correct - - Can you plz provide some details about it, how you concluded this..??

That's pretty basic stuff. It cannot be correct for a pointer in x86_64 to contain the value 0x20. That isn't a NULL pointer, but it also can't point to a valid address.

If you don't know those basics, I don't know how you could expect to find fix the bugs in your source code.

What about my earlier suggestion to use the 32 binaries on the 64 bit system?

sahil.jammu · 07-08-2009, 02:13 PM

void wlfs_frame_rsp( uint16 *lfs_msg_ptr, int length )
{

/* Data Structures */
uint8 i;
uint8 *data_ptr;
uint16 subframe;
uint16 cfn;
uint16 *subframe_ptr;
FDD_FRAME_RSP_STR *fddFrameRsp = NULL;
WCDSP_TRCH_DATA_STR trch_data;
WCDSP_CELL_CONFIGURATION_STR *wcdsp_cell_name_ptr;
WCDSP_SF_CONTROL_STR sf_control[ SUBFRAMES_PER_FRAME ];

WCDSP_PHCH_STATUS_STR *phch_iter;
WCDSP_PHCH_STATUS_STR *phch_last;
WCDSP_TRCH_STATUS_STR *trch_iter;
WCDSP_TRCH_STATUS_STR *trch_tmp1;
uint16 ctfc;
uint8 tfi;
FDD_CHAN_TB_STR *tb_ptr;
uint16 ch_code;
uint8 slot_format;
bool8 del_trch;

/*Code */
trch_data.crc_msw = 0;
trch_data.crc_lsw = 0;

if (del_trch == TRUE) {
if (trch_iter->prev == NULL) {
phch_iter->trch_first = trch_iter->next;
free(trch_iter);
trch_iter = phch_iter->trch_first;
if (trch_iter != NULL) {
trch_iter->prev = NULL;
}
} else {
trch_iter->prev->next = trch_iter->next;
trch_iter->next->prev = trch_iter->prev;
trch_tmp1 = trch_iter;
trch_iter = trch_iter->next;
free(trch_tmp1);
}

if (phch_iter->trch_first == NULL) {
phch_iter->trch_last = NULL;
}
} else {
trch_iter = trch_iter->next;
}
}
if (phch_iter->trch_first == NULL) {
if (phch_last == NULL) {
wcdsp_own_cell.wcdsp_phch_rx_status = phch_iter->next;
free(phch_iter);
phch_iter = wcdsp_own_cell.wcdsp_phch_rx_status;
} else {
phch_last->next = phch_iter->next;
phch_last = phch_iter;
free(phch_iter);
phch_iter = phch_last->next;
}
} else {
phch_last = phch_iter;

phch_iter = phch_iter->next;
}
}
dpch_in_sync = FALSE;
}

} else {
channel_lost_pending = FALSE;
channel_lost_delay = 0;
}
}
return;
}

sahil.jammu · 07-08-2009, 02:24 PM

Thanks for your suggesion, was in mid of other post so didnt checked your reply. Posted some info in previous post for data structures.
I will modify this and will test it.

The following code corrects the main problem in the above code, but it still is bad code:
Code:

phch_last = phch_iter;
phch_iter = phch_last->next;
free(phch_last);

Regarding your earlier suggestion:-
To use the 32 binaries on the 64 bit system?
I cant try that as my 32bit machine is down with bus error and automount NFS issue. Expecting the machine to be back in running state by tomorrow.
Meanwhile i do have access to 64bit machine, and passing logs of the test case executed on 32 bit machine.

Thnx for your time.

Regards
Sahil

sahil.jammu · 07-09-2009, 05:07 AM

Hi Johnsfine,

I modified the code as suggested by u:-

But havent really find any success.

Changes made:-
1.
The following code corrects the main problem in the above code, but it still is bad code:
Code:

phch_last = phch_iter;
phch_iter = phch_last->next;
free(phch_last);

2. void wlfs_frame_rsp( uint16 *lfs_msg_ptr, int length )
{

/* Data Structures */
uint8 i;

void wlfs_frame_rsp( uint *lfs_msg_ptr, int length )

3.
----
wlfs_frame_rsp( (uint16*)lfs_msg_ptr, length )
----
----
wlfs_frame_rsp( (uint*)lfs_msg_ptr, length )
----

Any further pointers to proceed further??

Regards
Sahil

sahil.jammu · 07-09-2009, 08:55 AM

Hi,

Another thing i checked while comparing passing log of 32 bit and inconclusive log of 64 bit:-

1.
Few bits changed in 64 bit output,

32bit output:-
------------------
MER: ISI receiving message[Str]
(printing 25/25 bytes)
00 1D 8C 00 09 00 12 FF 32 00 D7 03 01 01 00 00 ????????2???????
00 00 72 00 08 01 00 00 00 ??r??????
ISI?IsiMessage
{
media 29,
receiverDevice 140,
senderDevice 0,
resourceId 9,
receiverObject 255,
senderObject 50,

64 bit output:-

MER: ISI receiving message[Str]
(printing 25/25 bytes)
00 1D 8C 00 09 00 12 FF 33 00 D7 03 01 01 00 00 ????????3???????
00 00 72 00 08 01 00 00 00 ??r??????
------------------

-----------------

2.
F0r 32 bit
--------
.........LFS -> CDSP FDD_FRAME_RSP : ScrCode: 10, SFN: 14e, Chan: PCCPCH, Id: 0
,0, CTFC: 56
(printing 51/51 bytes)
00 04 29 43 00 01 00 10 01 4E 01 00 00 00 00 38 ??)C?????N?????8
80 00 00 00 29 CE 07 17 8A 00 81 00 00 00 00 00 ????)???????????
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ????????????????
00 00 00 ???
------------
Value is not printed at all in case of 64 bit

3.
@32 bit
---------------
(printing 52/52 bytes)
00 04 29 43 00 01 00 10 01 C6 01 00 00 00 00 AD ??)C????????????
80 00 00 00 38 CE 03 8C 00 00 00 06 00 C0 20 40 ????8????????? @
20 08 1A 01 00 00 29 FF FF F8 00 00 00 00 00 00 ?????)?????????
00 00 00 00 ????
..LFS -> CDSP FDD_FRAME_RSP : ScrCode: 10, SFN: 1c8, Chan: PCCPCH, Id: 0,0, CTFC
---------------
Value is not printed at all for 64 bit.

johnsfine · 07-09-2009, 09:19 AM

Quote:

Originally Posted by sahil.jammu

I modified the code as suggested by u:-

But havent really find any success.

That was a bug in your code, but there is no reason to guess it was the only bug, and you never showed enough information to guess whether it was the bug responsible for the current symptom.

If you want to continue posting code in forums, you should learn how to use code tags to make the code readable.

As usual, I can't even guess what you are trying to ask or tell with some of the code you quoted

Code:

wlfs_frame_rsp( (uint16*)lfs_msg_ptr, length )
----
----
wlfs_frame_rsp( (uint*)lfs_msg_ptr, length )
----

The (uint16*) and (uint*) in that code are casts of a type that is likely to be non portable when switching between a 32 bit and 64 bit architecture. So any construct like that might be the bug that is causing your current symptom, but out of context, it is impossible to say whether any particular one of those casts is wrong.