ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I was working on test cases which was passing successfully on 32 bit RHEL. Recently there was a upgrade to system and its upgraded to 64 bit. Some of the test cases started failing now with segmentation fault and core dump is getting generated.
Seems like pointers are behaving strangely, i am new to this and haven't looked into any core dump. Can anyone provide pointers on this, what would have went wrong , as per my understanding segmentation fault is coming as program reads memory outside allotted memory buffer space, which indicates memory leak or invalid pointers leading to corrupt memory address to read and write. Any specific difference in pointers behavior in 32 bit OS and 64 Bit OS
I used ddd to look into core dump, i opened it and did a back trace to understand the flow of the code,
In crash line (Filename_c:492) debugger shows this value:
(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0
But not this one:
gdb) print phch_last->next Cannot access memory at address 0x44
One important detail that isn't clear from your post:
Did you run a 32 bit binary of your program under the 64 bit kernel? Or did you recompile your source code into a 64 bit binary and run that?
Quote:
Originally Posted by sahil.jammu
Seems like pointers are behaving strangely, i am new to this and haven't looked into any core dump.
The first thing one usually looks at in a core dump is the call stack (the list of which function called which down to the point of failure). If you know how your code is supposed to function, that is usually enough to spot the bug.
Quote:
program reads memory outside allotted memory buffer space
Technically true, but that is a misleading way to describe it.
Quote:
which indicates memory leak or invalid pointers leading to corrupt memory address to read and write.
Invalid pointer (not a memory leak).
But that is just a symptom. The trick is to find the cause of the invalid pointer.
Quote:
Any specific difference in pointers behavior in 32 bit OS and 64 Bit OS
Pointers do exactly what the source code tells them to do. The pointer is inside the application program. It is not influenced by the OS.
A 64 bit OS lets you run 64 bit programs. It doesn't require that you run 64 bit programs.
There are lots of ways that programmers may have accidentally or intentionally assumed that a pointer is 32 bits. Recompile for 64 bits and the assumptions are wrong and the code will crash.
There are fewer ways (but not zero) that running a 32 bit binary under a 64 bit kernel would violate similar bad assumptions made by the programmer.
But not this one:
gdb) print phch_last->next Cannot access memory at address 0x44
I don't remember gdb print syntax in enough detail to be sure, but that all seems to mean:
phch_iter points to valid memory
phch_iter->next contains a 0 (meaning it does not point to a valid next object).
phch_last does not point to valid memory.
The address of phch_last->next is 0x44.
That last detail is strange, especially if you recompiled for 64 bit mode. In such situations, one would expect phch_last to be exactly zero (though checking that is better than guessing it), which means the field "next" is at offset 0x44 in the structure, which violates the usually packing rules for x86_64.
I have both the environments available 32 bit and 64 bit. Same test case i compiled in RHEL 32 bit and executed - Passed
and similar test cases compiled and executed in 64 bit environment is Failed (inconclusive) .
I tracked the place where the debugger is complaining about the code using ddd backtrace, but i am not very clear what is the expected behavior as the passing scenario is of 32 bit testcase (not sure how to check the value of variable there since there is no coredump in passing testcase for analyse).
In crash line (Filename_c:492) debugger shows this value:
(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0
But not this one:
gdb) print phch_last->next
Cannot access memory at address 0x44
Thanks for providing clarifications in other areas.
One option you may not realize (but it isn't actually difficult): You can take the 32 bit binary compiled on the 32 bit RHEL and run it on the 64 bit RHEL.
Probably there will be a few dependencies missing at first.
Those are 32 bit .so files that can be installed on the 64 bit RHEL but by default aren't.
You could use the ldd command on your 32 bit binary to predict which extra .so files you need, or simply try to run it and look at the error messages.
With the right Yum commands, you can easily install the 32 bit versions of each of the required dependencies on your 64 bit RHEL. Those coexist perfectly with the 64 bit versions you already have. (Any 64 bit binary will automatically find and load the 64 bit .so files, while 32 bit binaries will find and load the 32 bit .so files).
Recompiling your source code to a 32 bit binary on the 64 bit RHEL is also possible, but harder. If you just need to run the 32 bit binary on 64 bit RHEL, it is easiest to compile on your 32 bit RHEL and copy the binary over.
Quote:
Originally Posted by sahil.jammu
similar test cases compiled and executed in 64 bit environment is Failed
As I said before, there are many ways a programmer might accidentally or intentionally assume that pointers are 32 bit.
You don't sound like you are ready for the debugging task needed to find those bugs in your source code.
If you really need to get the 64 bit compiled code to work, I can give some more advice. But if it is a large amount of source code, there are probably several such errors and the first (easy looking) error has you stuck.
That is why I suggest running the 32 bit binary on the 64 bit RHEL.
Quote:
In crash line (Filename_c:492) debugger shows this value:
(gdb) print phch_iter->next
$2 = (struct _WCDSP_PHCH_STATUS_STR *) 0x0
But not this one:
gdb) print phch_last->next
Cannot access memory at address 0x44
I don't know what you intended to communicate by repeating that.
I thought I had explained why the debugger responds as it did to those two print commands.
It might help to tell us the value of phch_last.
It probably would help to quote the definition of the data structure that should be pointed to by phch_last. I expect it is the same data type _WCDSP_PHCH_STATUS_STR that is pointed to by its "next" pointer.
It might help to quote a few lines of code around the failure point.
I tried doing back tracing using DDD from the point of failure:- phch_last->next = phch_iter->next;
Your analysis:-
-------
phch_iter points to valid memory
phch_iter->next contains a 0 (meaning it does not point to a valid next object). phch_last does not point to valid memory.
The address of phch_last->next is 0x44.
-------
Values given by debugger:-
-----------------------
phch_last->next
(struct_WCDSP_PHCH_STATUS_ST ...ccess memory at address 0*44)
As you suggested the phch_iter -> next contains a zero, which reflects something must have went wrong at previous step, could be use of wrong data type, this same thing is working fine in 32 bit but not in 64 bit, so could be something to do with datatype, not sure- Just guessing.
From this line of failure i used up in ddd -> (select and prints the stack frame which called this one) , i got this result :-
If i go to this file and check the location of the code:-
Code:-
-----
case FDD_FRAME_RSP:
wlfs_frame_rsp( (uint16*)lfs_msg_ptr, length );
break;
-----
Any inputs from your side.
From main failure level there is no down result as it complains for segmentation fault. There are multiple Up value results which takes us to different functions that are been called .
Came Across this info:-
The uint* is primarily meant to store integer values. Most operations that manipulate arrays without changing their elements are defined. (Examples are reshape, size, the logical and relational operators, sub scripted assignment, and sub scripted reference.)
We can define your own methods for uint* (as we can for any object) by placing the appropriately named method in an @uint* directory within a directory on your path.
Also:-
usually Double data type (in C/C++), some of the other data representation that requires 64bits does work better and faster in 64 bits environment, however those smaller data (Int32 or Int16 and even Char) will get some penalty due to is longer space and instruction.
YOur inputs on mapping this information to our problem.
I don't see anything in your recent posts that helps in diagnosing the problem further.
Quote:
Originally Posted by sahil.jammu
As you suggested the phch_iter -> next contains a zero, which reflects something must have went wrong at previous step
phch_iter->next containing zero might or might not by a symptom of some earlier problem. It might be perfectly correct. You haven't shown enough code for me to estimate that.
phch_last containing 0x20 (I think that is what you said) would certainly be a symptom of some earlier problem. It cannot be correct.
phch_last containing 0x20 (yes its true as i checked the value using ddd)
As you mentioned this would certainly be a symptom of some earlier problem. It cannot be correct - - Can you plz provide some details about it, how you concluded this..??
That is very bad code and it might be the bug causing your problem.
That code makes an assumption about the behavior of malloc that could easily change across architectures (such as i386 to x86_64) or even across versions of malloc.
The code frees the object pointed to by phch_iter, then it reads a member variable from that object. It is not correct to assume any part of the object is still valid after the object is freed.
The following code corrects the main problem in the above code, but it still is bad code:
In situations where the original code works the improved code would also work. In some situations where the original code breaks, this improved code would work.
But it still is keeping a pointer to a freed object, which hints there are related bugs in sections of the code you didn't shown.
Quote:
phch_last containing 0x20 (yes its true as i checked the value using ddd)
As you mentioned this would certainly be a symptom of some earlier problem. It cannot be correct - - Can you plz provide some details about it, how you concluded this..??
That's pretty basic stuff. It cannot be correct for a pointer in x86_64 to contain the value 0x20. That isn't a NULL pointer, but it also can't point to a valid address.
If you don't know those basics, I don't know how you could expect to find fix the bugs in your source code.
What about my earlier suggestion to use the 32 binaries on the 64 bit system?
Thanks for your suggesion, was in mid of other post so didnt checked your reply. Posted some info in previous post for data structures.
I will modify this and will test it.
The following code corrects the main problem in the above code, but it still is bad code:
Code:
Regarding your earlier suggestion:- To use the 32 binaries on the 64 bit system?
I cant try that as my 32bit machine is down with bus error and automount NFS issue. Expecting the machine to be back in running state by tomorrow.
Meanwhile i do have access to 64bit machine, and passing logs of the test case executed on 32 bit machine.
That was a bug in your code, but there is no reason to guess it was the only bug, and you never showed enough information to guess whether it was the bug responsible for the current symptom.
If you want to continue posting code in forums, you should learn how to use code tags to make the code readable.
As usual, I can't even guess what you are trying to ask or tell with some of the code you quoted
The (uint16*) and (uint*) in that code are casts of a type that is likely to be non portable when switching between a 32 bit and 64 bit architecture. So any construct like that might be the bug that is causing your current symptom, but out of context, it is impossible to say whether any particular one of those casts is wrong.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.