LinuxQuestions.org - Memory fault

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Memory fault (https://www.linuxquestions.org/questions/programming-9/memory-fault-253497/)

My program exits with "memory fault". Can you tell me for what reasons this error occurs?

N.B.: I am running my code in ucLinux.

Your program is likely referencing data outside its valid address space.
Common reasons are:
- bogus array index
- NULL or unitialized pointer
- Too much recursion

Do you have a core file generated ?

No, I don't have any core file. What is this? and How can I generate it?

Thanks in advance.

I think I remember seeing that error returned on a SPARC, where the software had a bug and did an unaligned memory operation, which means it tried to read a 32 bit value where the lower 2 bits of the address pointer were not 0.
This should not happen on Intel processors though, they can do those types of memory access but take a few extra clock cycles to do the fetch.
What processor type is this error showing up on if I may ask.

A core file is an image of a process memory automatically generated when a program ends after some fatal exceptions, like the memory fault you observe.
Its purpose is to help post-mortem debug.
Maybe ucLinux isn't generating core files, or the core file can't be created because the process current directory isn't writable at crash time.
The fact that ucLinux is dedicated to hardware lacking a MMU may be important to investigate this problem.

What is your program ?
Did you wrote it and compiled it yourself ?
Is the same problem with this program happening on other Linux systems ?

ucLinux is not generating any core file. And you are right the system doesn't have any MMU. I am writing a video communication software and have a cross compiler for the system. The same problem doesn't occure in Red Hat Linux 9.0 and Fedora Core 2.

What is the target processor of your cross compiler. If its a unaligned access, sometimes called "bus fault" or "memory fault", then this will not occure in Intel, but will on most other processors, such as Motorola, SPARC. MIPS does not have this problem. So what kind of processor is it.

One possible reason of different behaviour between ucLinux and mainstream linux is the former not growing the process stack size dynamically, which may lead to the memory fault you observe.
There is a compiler option to increase this stack size (see flthdr command or FLTFLAG variable)

The processor is frv400, it is made by Fujitsu of Japan. They have their indigenious instruction set.

I'm not familiar with their processors, so compile and run this example program on your processor, and if it causes a bus/memory fault then that is what is going on. This program will fail if your processor is limited to aligned memory r/w, which is what I suspect your issue is. This program works on Intel because their processors have additional micro-code to handle double-fetches when accesses are not word aligned.

Code:

#include <stdio.h>



char buf[8]={0,1,2,3,4,5,6,7};



int main(int argc, char *argv[]) {

    int j;

    unsigned long *p;

    for (j=0; j<5; j++) {

        p=(unsigned long *)(buf+j);

        printf("address=0x%08X  value=0x%08X\n",p,*p);

    }

}

//Example Intel processor output

//address=0x08049494  value=0x03020100

//address=0x08049495  value=0x04030201

//address=0x08049496  value=0x05040302

//address=0x08049497  value=0x06050403

//address=0x08049498  value=0x07060504

Dear randyding,

I ran your program, its name was memalign, Here is the output:

$ memalign
address=0xC20CAAB4 value=0x00010203
Memory fault

It seems that your prediction is correct. Can you please explain a little more? How can I write my program to avoid this problem?

Farhan

In pseudo assembly, cuz I don't know your specific processor,
I'm showing an example of loading a 32 bit quantity from memory
at address 0x12345-0x12348 into the accumulator.
mov ix,0x12345 ;load index reg with constant address pointer
mov a,[ix] ;load 32 bit accum from address pointer ix

On your processor an exception interrupt is being generated
when you execute the mov a,[ix] because the least sig. 2 bits
of ix are not 0. Note the 5 on the end of the address,
it ends in 01 binary. It must be 00 binary or the processor
throws an exception.

This means you can only read 32 bit values from an address
that is divisible by 4, and 16 bit values from an address
that is divisible by 2 (remainder=0).

This is exactly why C does byte stuffing when you create
a char variable and a long varaible right next to each
other. The 3 bytes between the char and before the long
are wasted space. You can test this yourself by
by printing out the addresses of the two variables
and seeing that the char will start at address X and the
long will start at address X+4, even though it would
have fit at address X+1.

Take a look at the example program I gave you, you will
see I am doing a funky pointer cast from char * to
unsigned long *. Normally this causes no problems
if you are careful, but if the char * points to
an address not divisible by 4 (which is perfectly
valid for a char * BTW) then the unsigned long *
is not a valid pointer and will cause a memory/bus
fault when you use it. In my example program
the de-reference *p at the end of the printf()
is the exact point where the exception is thrown
because the pointer p is not divisible by 4 the
second time through the j for loop.

I hope you find your problem, historically this
is a very difficult problem to find if you have
a lot of code (and not written by you) to go through.

randyding,

When I changed the (unsigned long *) to (char *) the program ran properly. Why is that? It must be the case that for unaligned addresses two memory reads are performed. But who is doing it? The compiler?

I wouldn't say it ran properly after you made that change, because you altered the program logic at the same time. You changed the program so it reads five 8 bit values instead of five 32 bit values from buf[]. Asside from the obvious big/little endian difference, your output will look quite different from the example output shown from the Intel computer in the example. So you removed one bug but added a second bug to the code's logic by doing the pointer type change. The program doesn't work the same as it did before, in spite of the memory fault not happening after the change.

Like I said before, its is the processor causing the exception, it is a hardware exception and not the compiler's fault or the OS fault. Try running the same program on Intel, it works just fine. Its because of the differences in the processor hardware.

Somewhere in your code is lurking the same sort of flaw in the program logic, it may not look exactly like the example program I showed you but the flavor of the problem is the same. Some form of pointer casting that changes the size of the referenced type followed by pointer addition or the like.

I'm afraid its roll up the sleeves and test the code function by function, trying to half-split and isolate the problem to a specific function and going into the logic from there.

Russoue,
I'm not sure if it's supported with your architecture, but if you could use remote gdb on your program (gdbserver), that would greatly help you finding the parts of your code where the errors occur.
As Randyding pointed, your target CPU is big-endian while your code works on intel architecture, which is little endian. If your code isn't already taking this into account, you may first fix it at the source level before doing any runtime debugging.