C string as an array of chars and as a pointer to char

Alien_Hominid · 05-13-2009, 01:49 AM

Please look at the comments

Code:

/*
 * TEST CASE TO CHECK DIFFERENCES BETWEEN STRING AS
 * AN ARRAY OF CHARS AND AS A POINTER TO CHAR
 */

#include  <stdio.h>
//#include  <string.h>

int main(void)
{
	char a[] = "foo";
	char *b  = "bar";
	char * const c = "changeme";

	printf("a - %s %c %p\n", a, *a, a);
	printf("b - %s %c %p\n", b, *b, b);
	//printf("c - %s %c %p\n", c, *c, c);

	//c = "ok";   //compiler error
	//*c = 's';   //compiles, but segfaults when executing (why?)
	//c[1] = 't'; //compiles, but segfaults when executing (why?)
	
	b = "qwe"; //b lost previous address

	/*
	 * THESE DO NOT WORK
	 */
	//b[] = "zxc"; //compiler error
	//a = "dfg";   //compiler error
	//a[] = "rty"; //compiler error

	*a = "cvb"; //have no idea, what a hell it does (after xxd it seems it's writing MSB or LSB of &"cvb")

	printf("a - %s %c %p\n", a, *a, a);
	printf("b - %s %c %p\n", b, *b, b);
	
	/*
	 * THESE DO NOT WORK
	 */
	//b[0] = 't';  //compiles, but segfaults when executing (why?)
	//*b = 'b';    //compiles, but segfaults when executing (why?)

	b = a;

	printf("b - %s %c %p\n", b, *b, b);

	b[2] = 't';
	*(b+1) = 'b';
	*(++b) = 'n'; 
	--b;

	printf("b - %s %c %p\n", b, *b, b);

	a[0] = 'z';
	*(a+1) = 'e';
	//*(++a) = 'r'; //compiler error (expected)

	printf("b - %s %c %p\n", a, *a, a);

	return 0;
}

Why do I get such a strange output?

Code:

a - foo f 0xbff204b8
b - bar b 0x80485e0
a - 
oo 
 0xbff204b8
b - qwe q 0x8048609
b - 
oo 
 0xbff204b8
b - 
nt 
 0xbff204b8
b - zet z 0xbff204b8

Tested using gcc (GCC) 4.3.3.
Please elaborate. I also would like links explaining these disrepancies deeply.

taylor_venable · 05-13-2009, 05:56 AM

When you create a string using a literal and assign it to a char *, the actual data goes into the data segment of the binary and thus is in read-only memory, so modifying it is erroneous. However, if you call it a char array, it's more like saying:

Code:

char a[] = {'f', 'o', 'o'};

Where it's perfectly valid to change the array members. Observe this example:

Code:

#include <stdio.h>

int main(int argc, char **argv) {
        char *s = "dragonforce";
        printf("Address(s)   = 0x%08X\n", &s);
        printf("Value(s)     = %s\n", s);
        s[7] = 'a';
        printf("New Value(s) = %s\n", s);
        return 0;
}

And here's a debugging session:

Code:

(gdb) break 7
Breakpoint 1 at 0x1c000722: file test.c, line 7.
(gdb) run
Starting program: /home/taylor/test 
Address(s)   = 0xCFBF015C
Value(s)     = dragonforce

Breakpoint 1, main (argc=1, argv=0xcfbf01dc) at test.c:7
7               s[7] = 'a';
(gdb) print &s[7]
$1 = 0x3c000008 "orce"
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x1c000728 in main (argc=1, argv=0xcfbf01dc) at test.c:7
7               s[7] = 'a';
(gdb) The program is running.  Exit anyway? (y or n) y

Notice the location of the string data is way far away from the address of s. This covers both the block where you assign to various parts of c and the block where you assign to various parts of b. Also, check this document out: http://www.lysator.liu.se/c/c-faq/c-2.html

In the middle block:

Code:

b[] = "zxc";  /* invalid syntax */
a = "dfg";    /* type mismatch, char * vs. char[] */
a[] = "rty";  /* invalid syntax */

Alien_Hominid · 05-13-2009, 07:27 AM

Ok, great explanation, thanks (also for the link).
I'm not so worried about cases where compiler produces errors (yet, these are interesting) but about those where error is left unnoticed:

Code:

        *a = "cvb"; //have no idea, what a hell it does (after xxd it seems it's writing MSB or LSB of &"cvb")
	 /*
	 * THESE DO NOT WORK
	 */
	//b[0] = 't';  //compiles, but segfaults when executing (why?)
	//*b = 'b';    //compiles, but segfaults when executing (why?)

johnsfine · 05-13-2009, 07:52 AM

Quote:

Originally Posted by Alien_Hominid

*a = "cvb"; //have no idea, what a hell it does (after xxd it seems it's writing MSB or LSB of &"cvb")

I don't understand "after xxd". But otherwise, you are correct. That instruction says to overwrite the first character pointed to by a (the 'f') with the LSB of the address of "cvb".

Quote:

//b[0] = 't'; //compiles, but segfaults when executing (why?)
//*b = 'b'; //compiles, but segfaults when executing (why?)

It is an original design flaw in the C language that you can use a char* to point to quoted text, rather than needing a char const*

The contents of quoted text must not be modified. There may (or may not) be run time enforcement (the segfault) for the rule that quoted text must not be modified. But either way it is a bug to modify quoted text.

Code:

	char a[] = "foo";

That allocates a char[4] buffer on the stack and copies {'f', 'o', 'o', 0} into that buffer.

Code:

	char *b  = "bar";

Makes a pointer (which can be changed) to text, which cannot be changed, but the compiler is effectively told to ignore the fact that the text cannot be changed.

Code:

	char * const c = "changeme";

Makes a pointer which cannot be changed to text, which the compiler is told to pretend can be changed.

Code:

c = "ok";   //compiler error

Try to change a pointer which you declared cannot be changed.

Code:

*c = 's';   //compiles, but segfaults when executing (why?)

or

Code:

c[1] = 't'; //compiles, but segfaults when executing (why?)

Try to change text that you declared as being changeable but it isn't.

Code:

	b = "qwe"; //b lost previous address

Change a pointer. No problem.

Code:

b[] = "zxc"; //compiler error

C has support for that kind of copy only on the line defining a char array, not as a later executable action.

Code:

a = "dfg";   //compiler error

a is an address, not a pointer. An address relates to a pointer the same way a number relates to an int variable. Consider

Code:

int x=5;
int u=x;  // Can use x (an int variable) the way we might use a number
int v=7;  // Can use 7 (a number) as a number
x = 4;    // Can change an int variable to have a new value.
7 = 4;    // Cannot change a number to have a new value.

The above is obvious and doesn't confuse any beginners. But the corresponding similarity/difference between and address (such as a in your code) and a pointer (such as b) confuses most beginners.

Alien_Hominid · 05-13-2009, 10:58 AM

I xxd'ed output to check values of those bytes.

Quote:

Originally Posted by johnsfine

The contents of quoted text must not be modified. There may (or may not) be run time enforcement (the segfault) for the rule that quoted text must not be modified. But either way it is a bug to modify quoted text.

Shouldn't compiler check and produce error for all 3 previous cases which compiles but either later segfaults or are, imho, useless (allowing to place LSB of an address into memory pointed to by array name)?

Pointer holds an address the same way as array's name points to it's location. The only difference it seems is that they point to different memory locations, therefore one can be changed and the other can't. Consequently, the question arises if this behaviour is inherent C problem (not defined in C standard) or some sort of problem in compiler allowing things, which shouldn't be allowed.

EDIT: Had removed false assumptions before anyone responded.

PTrenholme · 05-13-2009, 11:59 AM

First, a caveat: When I first started programming (in the middle of the last century) the only programming language available was assembly. (Well, I did some "programming" by moving wires on "programming boards," but not very often.) So I may be prejudiced by my experience during my formative years.

With that caveat, I think a lot of the "pointer / value" confusion some people seem to have might be reduced if they took the time to learn at least the basics of assembly language.

Anyhow, it should be easy to remember that a "pointer," p, refers to a specific location in your computer's RAM, and the "value," *p, refers to whatever is stored in RAM at that location. (And, of course, a "reference," &p, to a value is the address of the RAM where the value is stored.

Anyhow, that's my

for the above discussion.

Alien_Hominid · 05-13-2009, 12:04 PM

I have some basics in i386 assembly (therefore I would like to able to modify all memory made available for the program

). The thing I got confused is that one is allowed to modify memory values in compilers standpoint (why?) but not in reality (segfault). Anyway, thanks costs nothing.

johnsfine · 05-13-2009, 12:27 PM

Quote:

Originally Posted by Alien_Hominid

Shouldn't compiler check and produce error for all 3 previous cases which compiles but either later segfaults or are, imho, useless

Once a flaw in language design has been in place for many years, it is very hard for the compiler to usefully improve the situation. Consider the following code (from your own example):

Code:

	char a[] = "foo";
	char *b  = "bar";
. . .
	b = a;
. . .
	b[2] = 't';

b starts out pointing to text that must not be changed, with the declaration telling the compiler that b points to text that can be changed.

Later b is changed to point to text that can be changed. Note that it isn't possible to change the declaration of b there, only where it points.

Finally b is used to modify part of the text it points to.

All together, those steps are correct and sequences like that happen in many correct programs. A single pointer variable might be:
1) Used by a sections of the code that don't modify the contents
2) Set in some places to text that can't be modified
3) In other places set to text that can be modified and then actually modified.

Sections 2 and 3 must obviously be disjoint enough that they don't trip over each other, but each might be so well connected to 1 that there is no clean place to make a different declaration for the modifiable text vs. non modifiable. Most of us would consider that combination at least unfortunate if not absolutely bad style. But it still happens in enough old C code to be a problem for a compiler rejecting the assignment of a quoted string to a char*.

PTrenholme · 05-13-2009, 03:10 PM

Quote:

Originally Posted by Alien_Hominid

I have some basics in i386 assembly (therefore I would like to able to modify all memory made available for the program

). The thing I got confused is that one is allowed to modify memory values in compilers standpoint (why?) but not in reality (segfault). Anyway, thanks costs nothing.

Ah, well, that gets into the issue of "protected" and "unprotected" memory, and (as johnsfine mentioned) allocation of memory in the stack.

You can, in fact, modify the contents of all unprotected memory allocated by your program. But static strings are allocated in protected (and shared) memory, and, therefore, can't be modified by your program. (The point is that many programs declare the same strings and constants in different places, and the actual physical size of the program can be reduced by reusing those definitions. But this optimisation fails if the constant or string can be changed.) When I programed in "B" (the precursor to "C"), I needed to be very cautious making assignments since B had no data types, and all RAM was modifiable by any program. (For amusement we liked to write self-modifying programs, where execution of the code resulted in a different program being run. That sort of thing is fine for a single-user system, but not so "cool" when someone else is trying to use the hardware to get some "real work" done.)

So the current use of "segments," some of which are static and some modifiable, is a vast improvement over the "have at it" days of yore.

Bottom line: Some program data (often, most data) is allocated to static segments, and an attempt to modify the contents of a static segment causes the "seg fault."

So, to reiterate, if you want be able to change values in RAM, those values must be declared in such a way that they are located in an unprotected memory segment. One way to do that (in C) is to explicitly reserve dynamic (i.e., modifiable) RAM for the value with the malloc - or similar - function. For numeric values, a simple <type> name; suffices, but arrays - especially dynamically sized arrays - need more work.

taylor_venable · 05-13-2009, 04:01 PM

You can't check it in the compiler because the compiler simply doesn't have all the information required. Note the example of johnsfine above. How could the compiler know if the string you're assigning into is declared extern? It can't, thus it doesn't check. Separate compilation FTW.

johnsfine · 05-13-2009, 04:49 PM

Quote:

Originally Posted by PTrenholme

static strings are allocated in protected (and shared) memory,

Quoted strings usually are allocated in protected shareable memory.

Memory protection is managed on 4K byte boundaries. I don't think the linker is required to waste memory up to the next 4K byte boundary when the protection requirements change for the next link time allocation. So each 4K byte block must have the least protection of anything allocated in that block. So I think a quoted string might be allocated in the same 4Kb block with a compile-time initialized writable global variable, in which case writing to that text would not seg fault.

Obviously you shouldn't overwrite quoted text and you shouldn't be surprised when doing so seg faults. But unless you have taken more specific control of these link time issues, you shouldn't rely on that seg fault.

theNbomr · 05-14-2009, 11:59 AM

Quote:

Originally Posted by PTrenholme

With that caveat, I think a lot of the "pointer / value" confusion some people seem to have might be reduced if they took the time to learn at least the basics of assembly language.

Couldn't agree more...

With respect to protected vs. unprotected memory and storage of literal strings there, one should consider also that C can be used to produce code which runs from various forms of Read-Only-Memory. There, the literal strings are electronically immutable, so trying to write to memory that is mapped as ROM/PROM/EPROM/EEPROM may fail (or not, perhaps) in various ways. Thinking about the situation in these terms can help clarify the reasons for the behavior of the compiler and the runtime code.
--- rod.

Alien_Hominid · 05-14-2009, 05:08 PM

Then there should be some switch in gcc to tell where to place literal strings.
http://www.lysator.liu.se/c/c-faq/c-17.html#17-20

osor · 05-15-2009, 05:01 PM

Quote:

Originally Posted by Alien_Hominid

Then there should be some switch in gcc to tell where to place literal strings.
http://www.lysator.liu.se/c/c-faq/c-17.html#17-20

GCC had (until the 4.x branch) a flag -fwrite-strings (or something like that), which would allow backwards compatibility with K&R C (which didn’t specifically forbid writing to string literals). There is currently the warning flag -Wwrite-strings which will emit warning for such uses.

Additionally, there are some architectures which are targets for gcc which don’t have a read-only data segment, and on which funny things can happen.

Alien_Hominid · 05-17-2009, 12:35 PM

Thanks too.