[SOLVED] request C source code review: rmw

Andy Alt · 05-05-2013, 11:31 AM

I was reading the thread about Becoming a better programmer. and thought I'd see if anyone's interesting in reviewing the C source code from my latest release of rmw.
http://rmw.sf.net/

ntubski · 05-05-2013, 12:46 PM

rmw.c:66

Code:

// shorten PATH_MAX to two characters
#define MP PATH_MAX + 1

If you use a #define this way, you should put parenthesis around the definition, otherwise you will get unexpected results (eg MP*2 == PATH_MAX + 2). Using a const variable, or an enum definition are nicer ways to fix this:

Code:

#define MP (PATH_MAX + 1)
const int MP = PATH_MAX + 1;
enum { MP = PATH_MAX + 1 };

rmw.c:100

Code:

typedef unsigned short int LooP;
// even a short int is overkill sometimes.
// using unsigned char for loops less than 256
typedef unsigned char smallLooP;

Just use an int sized counter for loops, this isn't saving anything. The loop counter will go in a register which will be int sized (or bigger), or it will go on the stack which typically has to be aligned so small ints will be padded to int size (or bigger).

rmw.c:126

Code:

int
main(int argc, char *argv[])
{

#include "main.c"

}

Don't use the preprocessor like this; only #include h files, only put declarations in h files.

func02.c:3

Code:

/* from time.h */
char *
strptime(__const char *__restrict __s, __const char *__restrict __fmt,
struct tm *__tp);

Why not just #include time.h?

funcs02.c:45

Code:

/* Before copying or catting strings, make sure str1 won't exceed
* PATH_MAX + 1
* */
bool
buf_overflow_check(char *s1, const char *s2, bool mode)

buf_overflow_check appears to perform the copy or concat operation, its name should reflect that. Right now it sounds like it only does the check.

Speaking of names, you aren't very consistent with under_score_names vs camelCaseNames (I confess I'm often guilty of this myself). Also, the file name should be better than "funcs02" (is there a distinction between the functions in "funcs" and "funcs02"?).

Andy Alt · 05-05-2013, 04:37 PM

I'm making those changes you suggested; they all make sense to me. But I have a question about the includes.

Should I rename funcs.c, funcs2.c, and str_funcs.c so they have an .h extension? Or was your point mostly not to #include main.c the way I did?

I don't know what you meant when you said to only put declarations in .h files. Do you mean function definitions? If so that answers my first question.

And I'll give them more descriptive file names and group the functions better, as you hinted at.

ntubski · 05-05-2013, 05:25 PM

Oh, I didn't notice that you #included the other .c files as well, you shouldn't do that either. The #include of main.c is kind of worse than other ones because it splits the function prototype and braces from its body, but #including c files is not a good of way of organizing a multi file program.

Code:

// This is a function declaration
int a_function(int arg, int another_arg);

// This is a function definition
int a_function(int arg, int another_arg)
{
    // some code
}

You put definitions in c files, and compile them to object files (foo.c --> foo.o). Object files contain compiled forms of the definitions, you then link all the object files together to form a program (one of the object files must have a definition of main() for this to work).

Since you are using automake, you can do this by adding all the c files to the rmw_SOURCES variable. Automake will then generate the make rules to generate all the object files and link them together.

You should declare a function before using (defining a function also declares it). Instead of repeating the declaration of a function in every c file that uses it, you should put declarations of every function defined in foo.c into foo.h, and then #include foo.h in those c files which use functions from foo.c.

For example, str_funcs.h:

Code:

#ifndef str_funcs_h
#define str_funch_h

int
trim(char s[]);

void
cut(char c, char *str);

bool
change_HOME(char *t);

int
trim_slash(char s[]);

void
truncate_str(char *str, short len);

#endif//str_funcs_h

The #ifdef stuff avoids declaring things multiple times even if the h file is #included multiple times (which can happen when h files #include other h files), it doesn't matter for function declaration, but it does become an issue for enums and #defines.

johnsfine · 05-06-2013, 06:51 AM

Quote:

Originally Posted by ntubski

Just use an int sized counter for loops, this isn't saving anything.

Good advice, but an understatement. In fact, using a short or char for a loop index is usually costly compared to using an int.

To help squeeze a little more performance from portable code, I like to have a typedef for my index type:

typedef unsigned int index_typ;

If you want portability without grabbing the last drop of performance, define that as size_t, rather than unsigned int. See below).

I use the index type mainly for loop indexes where I know the value is non negative and "reasonable size".

My definition of "reasonable size" is capped by the maximum number of elements in one layer of a container. I work mainly in 64 bit, so a program may use much more than 4GB of data memory. It may even use over 4GB in a container. But (so far) nothing I have done comes close to needing 4G elements at one level of a container. 500M complex numbers is about the biggest data structure I ever touch, which is 8GB but only 500M elements. No one data structure is a large fraction of total ram in anything I work on, so a problem with one container of 500M complex numbers has enough other data that I need to borrow the highest physical ram computer to which I have occasional access in order to run that test.

In x86-64, the most efficient index type is unsigned int (which is 32 bit). Signed int may be more or less efficient as an index than a 64-bit integer type, but each is less efficient than unsigned 32 bit.

In 32-bit x86, signed or unsigned 32 bit int are equally the most efficient index type. But I like the code clarity of indicating that the values of an index are not intended to ever be negative.

Someday, I will need to modify some of my code to support over 4G elements. Using a typedef for the index type lets me do that in one place, rather than changes all over the code. I can just redefine the index type to be size_t instead of unsigned int.

The less common case where an index variable needs to be able to go negative, obviously should be distinguished in the source code by not using index_typ. Probably I should have another typedef for those, but it hasn't come up often enough to make me get organized about it.

johnsfine · 05-06-2013, 07:18 AM

I don't know whether you are interested in making your code faster. Most experts claim you should not worry about performance until you have a problem and then only look where a profiler indicates hot spots.

I prefer to pay attention to performance all the time and even more in basic functions that might get used all over.

I have seen too many projects where terrible performance is distributed in so many places that profiling later is not effective.

Here is a simple example where thinking about simplicity and efficiency yields faster executing code that, in my opinion, is more obvious and maintainable than the original.

You had:

Code:

  n = strlen (str);
  for (i = 0; i < n - inc; i++)
  {
    str[i] = str[i + inc];
  }
  
  str[n - inc] = '\0';

I suggest

Code:

  n = strlen (str+inc);
  for (i = 0; i <= n; i++)
  {
    str[i] = str[i + inc];
  }

In context where you wrote that, it should be obvious why strlen(str+inc) is safe and meaningful.

It would be even better to use memcpy instead of the loop, so my example is intended more to stimulate a way of thinking about such code, rather than give a final version for that function.

It might better and more readable to just use strcpy for the whole thing:

Code:

strcpy(str, str+inc);

Andy Alt · 05-06-2013, 11:57 AM

Quote:

Originally Posted by johnsfine

I don't know whether you are interested in making your code faster.

Yes, I try to keep efficiency in mind when I code.

Quote:

It might better and more readable to just use strcpy for the whole thing:

Code:

strcpy(str, str+inc);

Getting the most out of pointers and pointer arithmetic is something I've not achieved yet, but as I understand this, basically str+inc is a pointer to str[0 + inc] ?

I've made the change, thanks.

Regarding loops, are you saying that generally speaking, it's more efficient to use unsigned ints when negatives aren't required?

Quote:

Originally Posted by ntubski

You put definitions in c files, and compile them to object files (foo.c --> foo.o). Object files contain compiled forms of the definitions, you then link all the object files together to form a program (one of the object files must have a definition of main() for this to work).

Since you are using automake, you can do this by adding all the c files to the rmw_SOURCES variable. Automake will then generate the make rules to generate all the object files and link them together.

You should declare a function before using (defining a function also declares it). Instead of repeating the declaration of a function in every c file that uses it, you should put declarations of every function defined in foo.c into foo.h, and then #include foo.h in those c files which use functions from foo.c.

I'm in the process of making these changes but thinking about linking object files is something I have no experience with yet, so I have a few questions about some roadblocks.

This is near the top of rmw.c, and they contain the function prototypes.

Code:

#include "trivial_funcs.h"
#include "str_funcs.h"
#include "funcs.h"
#include "funcs02.h"

Their definitions are in the corresponding.c files, and I"ve used the #ifndef directive in each .h file as you indicated.

And I've found this needed to be added to the top of trivial_funcs.c (new file I created to handle functions like version(), print_help(), warranty()...)

Code:

#include <stdio.h>
#include <stdlib.h>

and I've added this to the top of funcs02.c

Code:

#include <stdbool.h>
#include <termios.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
/* #include <syscall.h> */
#include <dirent.h>

And so I'll be able to remove some of these directives from rmw.c

Am I understanding this right so far?

I've changed the line in Makefile.am to read

Code:

rmw_SOURCES=rmw.c trivial_funcs.c funcs02.c funcs.c str_funcs.c

So right now my big problem is that because I'm not #including my functions but linking the object files, globals I defined in rmw.c don't work across the different .c files. Such as MP and CPY, and bypass.

What is the recommended method for handling this?

Thanks to both of you for the advice. Good stuff!

johnsfine · 05-06-2013, 01:46 PM

Quote:

Originally Posted by Andy Alkaline

as I understand this, basically str+inc is a pointer to str[0 + inc] ?

Yes. str+inc === &( str[inc] )

In the way C pretends to have string support, the address of a string and the address of the first character of the string are not just the same address, but also the same type (they are entirely the same thing).

Quote:

Regarding loops, are you saying that generally speaking, it's more efficient to use unsigned ints when negatives aren't required?

Only in x86-64. In 32-bit x86 there is no performance difference between a signed vs. unsigned 32-bit index. In x86-64 there is no performance difference between a signed vs. unsigned 64-bit index, but in typical usage there is a performance difference between a signed vs. unsigned 32-bit index and between a 32-bit vs. 64-bit index.

The exact usage (how the index is used inside the loop as well as in the for instruction itself) determines the relative efficiencies. If you aren't a skilled asm programmer, it would be hard for you to predict in individual cases whether there would be a difference and if so, which is better. But averaged over many loops, unsigned 32-bit is better. Signed 32-bit is almost never better than unsigned and often worse. 64-bit is rarely better than unsigned 32-bit and often worse.

My main point in that discussion was to suggest using a typedef for your index type, so that you can change your mind later about the underlying type, and then only change one place in the code.

Quote:

So right now my big problem is that because I'm not #including my functions but linking the object files, globals I defined in rmw.c don't work across the different .c files. Such as MP and CPY, and bypass.

What is the recommended method for handling this?

As a matter of programming style, as well as learning how to work in big projects, you should consider globals to be a generally bad thing.

Often a global is the easiest way to communicate something. In a quick project, you can just do that because it is easy. In a big project, it is worth extra effort to find some other way to communicate the same information.

C++ gives you more and better ways to avoid globals than C, does. Maybe for small projects in C, it is simplest to just use globals, then learn C++ before doing big projects. Others probably disagree and have stronger opinions on avoiding use of globals in C.

If you do use globals, then just like functions they should be pre declared in a .h file.

Unlike functions, globals need the keyword extern to indicate that you are just declaring them and not defining them.

If I remember correctly (given that I don't code this sort of thing myself anymore):
To declare a char array

Code:

extern char my_array[];

To define it (whether or not it was already declared)

Code:

char my_array[ SIZE ];

The size is optional when pre declaring an array and required when defining it.

When I used to code big projects in C, I used a lot of "context objects".

A context object is a struct containing a bunch of things that would have been globals in a smaller simpler project. Some .h file defines the layout of that struct. The outer function of the section of the project using those variables allocates (stack or malloc, whichever is easier) and initializes that struct (often those global-like things are global enough that main() needs to allocate the context object). Then every function needing those variables takes the address of the context object as a parameter.

Andy Alt · 05-08-2013, 08:46 PM

I'll have to do a lot of re-writing to get rid of the global variables, I wish I had more time and patience right now to work on it. I've changed the names of some of the functions and lowercased some variables, and did some of the other changes suggested.

I spent several hours trying to link the files together, after I put the globals in a separate .h file and tried many combinations of many different things, and several google searches, but was ultimately unsuccessful and received many error messages. But getting rid of the global variables will solve those and I'll be able to try linking again when I get back to it. But of course in the meantime I had to include the functions.c files to get the program to compile. But it's good to know how to do it now, and that it's better programming technique.

Thanks folks.

johnsfine · 05-09-2013, 06:16 AM

Quote:

Originally Posted by Andy Alkaline

I put the globals in a separate .h file

You are not supposed to define globals in an .h file, just pre declare them in the .h file and define them in one .c file (just as you do with functions).