What I discovered about pointers when I was learning C

Posted 07-25-2019 at 06:25 AM by hazel
Updated 08-29-2020 at 07:51 AM by hazel

I first started fiddling about with C when I became a Linux user. I had done some previous hobby programming; like many people back in the 1970's, I started with BASIC, moved on to Fortran and then Pascal, which I particularly liked because of its logical structure.

In all these languages, a variable was simply a named entity that had a type (for example integer) and a value. You did not have to bother about where it was stored. The computer handled that for you. These languages were designed for people who were not computer experts. BASIC was for naval engineers to write quick calculation programs, Fortran for scientists, and Pascal to teach students good coding habits, in particular the avoidance of "spaghetti code".

By contrast, C is a programming language designed by and for programmers. It is very powerful (you can do almost anything in C) but it requires some awareness and understanding of the way computers actually work, where a program's data is stored and how it is retrieved. While Pascal sometimes uses pointers, C and its derivatives use them all the time, and I found them very heavy going at first.

To make matters worse, all the manuals are written by people who already know this stuff and assume that you do too, and only need to be told how to implement it. It seems to me that the best teacher for beginners is one like me who has only recently learned the subject herself.

Programs store data in at least three different places. Global and static variables are stored with the program code, the rest go on the stack or the heap. Physically there is no distinction; memory is memory. But when the Linux kernel is mapping out a new process's memory, separate areas are reserved for the stack and the heap. For example, I just looked at the memory map for a bash process I am running and found this:

Code:

01e68000-01ee5000         rw-p 00000000 00:00 0       [heap]
7ffe9ebe0000-7ffe9ec01000 rw-p 00000000 00:00 0       [stack]

Named variables (other than global or static ones) go on the stack. This is a very orderly place. It is called a stack because each function stacks its variables there on top of those of its caller. When the function launches, space is automatically allocated to each variable according to the storage requirements of its type, and automatically cleared again when the function exits. For this reason, stack variables are often called "automatic" variables.

Functions never have any problem in locating their stack variables. They are stored in blocks according to type, so that the compiler can assign addresses to them by a simple offset. Of course it is the programmer's job to ensure that when the variable is retrieved for use, it actually has a value stored in it!

But many kinds of data don't have a fixed size. Character strings and multi-field structures can be any size. Languages like Fortran store character strings in fixed-length arrays which need to be long enough to accommodate the longest string that is likely to go into them but this is very artificial, since arrays were originally devised to store assemblies of numbers for random access whereas text can expand and contract and is meant to be accessed sequentially.

In C, variable-length objects of this kind go on the heap. As its name suggests, the heap has very little structure to it. The addresses of heap objects are stored in named pointers and that is how the program finds them. You can have pointers to stack objects too but this merely provides an alternative way of accessing them; for heap objects, it is the only way.

What is a pointer? It's simply a named variable that contains a memory address. All the bytes in memory are numbered from bottom to top, so a memory address is just a byte number. It occupies the same amount of space as any other long integer but, by convention, the value of a pointer (when requested) is always written out in hexadecimal notation. This value is assigned when you set the pointer to point to something. You do not actually need to know the value except for debugging sometimes.

Accessing the data is called dereferencing the pointer. To do this, the computer needs two pieces of information: it needs to know where to start and where to finish. The first requirement is satisfied by the address of the first byte of the data, which is stored in the pointer. The second is usually determined by the size of the type of data stored. Pointers have types just like any other variable. The type of a pointer is simply the type of the data being pointed to.

If a structure of a certain type requires 40 bytes to accommodate it, then you must ask for 40 bytes to be reserved for it on the heap and put the address which is returned into your pointer. Whenever the pointer is dereferenced, the program will know where the structure ends (the address stored in the pointer plus 39).

For a structure that forms part of a library you are using, there will be a specific library function for creating such a structure and returning a pointer to it. Again, you put the returned address into the pointer variable you have previously declared. If it is an open-source library and you choose to look at the actual code for that function, you will see that at some point a buffer of a certain size is requested and then the appropriate data for the structure is written into it.

A pointer without a valid address in it should always be set to zero and is then called a NULL pointer. A pointer without a declared type is called a void pointer. Neither a null nor a void pointer can be dereferenced, because with a null pointer, the program wouldn't know where to start reading, and with a void pointer it wouldn't know where to stop. gcc will pick up attempts at the latter, since it's obvious from the code; the former can sometimes only be determined at run time, when NULL pointers yield segmentation errors.

A somewhat different strategy has to be used for sizing strings. A pointer to a string will be of type char or wchar, depending on whether you are using ISO-Latin coding (one byte per character) or UCS2 (2 bytes per character), but the string could contain any number of such characters. C therefore defines a string as a sequence of char ending in a NUL character ("\0") or of wchar ending in a "Wide NUL". Standard string functions will read only up to the first such character that they find. Even if you have reserved a larger buffer in anticipation of the string growing, any residual garbage that might be present beyond the string's terminating NUL will not be read.

One point which cannot be too highly stressed is that declaring a pointer of a certain type does not by itself reserve any space for the data. That is a separate operation. For a string or a structure that you have defined in your program, you will need to use the memory allocation function malloc() or one of its relatives to create the buffer for you. These functions return an address and you put that into your pointer variable. If the buffer is for a string, you can then use one of the libc string copying functions to put the actual text into it, but you cannot copy anything to a pointer that does not point at anything.

Posted in Programming

Views 2457 Comments 3

« Prev Main Next »

Total Comments 3

Comments

Excellent post, however one small error.

Quote:

It occupies four bytes just like any other long integer

Pointers take 8 bytes on a 64bit platform. sizeof(long) also has a tendency to change depending on platform so that probably isn't the best way to describe the length.

Another minor point, you can TRY and dereference a NULL pointer, unlike a void * the compiler won't throw an error, but the kernel will segfault your program at runtime when you try and access memory address 0. I suspect if you tried the same on DOS it would return a value.

P.S. I also loved Turbo Pascal back in the day.

Posted 07-25-2019 at 07:36 AM by GazL GazL is offline

Updated 07-25-2019 at 07:38 AM by GazL

Thanks. I have incorporated both corrections.

Posted 07-25-2019 at 08:57 AM by hazel

I must totally agree with GazL; absolutely brilliant blog post Hazel! That makes more sense than probably anything I've read about pointers and alike so far! I've even bookmarked this blog post for future reference! And it's the first blog I've ever rated here too, and I rated it five stars! Excellent work Hazel! You are very good at explaining things, you really are!

Excellent work, well worth five stars! Nice one Hazel!!!

Posted 07-29-2019 at 10:42 AM by jsbjsb001 jsbjsb001 is offline

Updated 07-29-2019 at 11:06 AM by jsbjsb001 (typo)