[SOLVED] Optimize simple C code

H_TeXMeX_H · 12-06-2012, 10:47 AM

I want to count the number of each of the 256 possible bytes in a file.

This is the fastest code I have produced so far:

Code:

// output character statistics
#include <stdio.h>
#include <math.h>

#define NUMCHARMAX 256

int main (void)
{
	// ! init
	unsigned char cluster[BUFSIZ];
	unsigned int i;
	
	// init
	unsigned long count[NUMCHARMAX];
	
	for (i=0; i < NUMCHARMAX; i++)
	{
		count[i] = 0;
	}
	
	// input loop
	while (fread(cluster, BUFSIZ, 1, stdin))
	{
		for (i=0; i < BUFSIZ; i++)
		{
			count[cluster[i]]++;
		}
	}
	
	// total calc
	for (i=0; i < NUMCHARMAX; i++)
	{
		printf("%u %lu\n", i, count[i]);
	}

	return 0;
}

I would like to keep the code portable, but using MMX or SSE* is ok if that would help.

Thank for any suggestions on how to optimize this further.

linosaurusroot · 12-06-2012, 11:12 AM

What test cases are you using? For instance does your total counted always match the length of the input file?

I wouldn't use fread() for this - just read() or mmap().

johnsfine · 12-06-2012, 11:50 AM

I don't think you can accomplish anything optimizing the CPU speed of the actual operation. The operation, even including the full loop overhead if the compiler doesn't optimize that for you, is trivial compared to the memory accesses.

If BUFSIZ is not incredibly large then you shouldn't care whether you optimized.

If BUFSIZ is incredibly large then allocating it on the stack is questionable and the run time will be dominated by the cache misses (mainly cache misses that occur inside your call to fread).

The small potential for improvement here, assuming BUFSIZ is incredibly large, is to break into an outer loop over a decently chosen chunk size (amortize the overhead of fread with big chunks, but make them small enough that the cache misses that need to occur in fread don't occur a second time when you read from the buffer).

I don't know about mmap of stdin. For an ordinary file, mmap might reduce the cache misses and some other processing time compared to fread. Plus it causes the kernel to effectively manage that chunk size for you, which is simpler and maybe better than doing so yourself.

H_TeXMeX_H · 12-06-2012, 12:24 PM

Quote:

Originally Posted by linosaurusroot

What test cases are you using? For instance does your total counted always match the length of the input file?

I wouldn't use fread() for this - just read() or mmap().

The size of the input can be anything. The total should match the length of the file, plus or minus one. The statistics I am doing aren't altered by small amounts of error.

I tried using read, and it has not sped anything up, it's just as fast. I have never used mmap, and from what I've read there usually is no significant improvement because the kernel handles it.

Quote:

Originally Posted by johnsfine

I don't think you can accomplish anything optimizing the CPU speed of the actual operation. The operation, even including the full loop overhead if the compiler doesn't optimize that for you, is trivial compared to the memory accesses.

If BUFSIZ is not incredibly large then you shouldn't care whether you optimized.

If BUFSIZ is incredibly large then allocating it on the stack is questionable and the run time will be dominated by the cache misses (mainly cache misses that occur inside your call to fread).

I have tried using '-O3', but it doesn't improve anything. You might be right that this may be the best that can be done.

BUFSIZ is around 4096.

I have also tried using pthreads, but the code ran slower than this one for some reason. Maybe I just messed up. I don't know how to use SSE or MMX, nor if they would help in this case.

johnsfine · 12-06-2012, 12:47 PM

Quote:

Originally Posted by H_TeXMeX_H

BUFSIZ is around 4096.

I didn't see an outer loop. If the whole task is that size, why optimize?

If the whole task is much larger, I think a chunk size of 4096 is rather low and means the time is dominated by fread overhead.

Quote:

I have also tried using pthreads, but the code ran slower than this one for some reason.

If the cores (or worse yet, hyperthreads) are sharing L2 cache, then multi threading this kind of problem should be expected to make it overall slower.

Quote:

I don't know how to use SSE or MMX, nor if they would help in this case.

I'm still pretty sure cache misses and read overhead are so dominant that the low level cpu instructions of the intended operation are basically irrelevant.

H_TeXMeX_H · 12-06-2012, 01:15 PM

Quote:

Originally Posted by johnsfine

I didn't see an outer loop. If the whole task is that size, why optimize?

If the whole task is much larger, I think a chunk size of 4096 is rather low and means the time is dominated by fread overhead.

If the cores (or worse yet, hyperthreads) are sharing L2 cache, then multi threading this kind of problem should be expected to make it overall slower.

I'm still pretty sure cache misses and read overhead are so dominant that the low level cpu instructions of the intended operation are basically irrelevant.

I just tried 2 and 4 times the standard BUFSIZ and it doesn't help. Anything lower than BUFSIZ will be slower, so I think it is the right size.

There are 4 cores with each pair sharing 3 MB for a total of 6 MB for all 4 cores.

I'll leave the thread up a bit, and if nobody has any magical optimizations, I think this is as optimized as it is going to get.

Thanks for the input.

H_TeXMeX_H · 06-05-2013, 01:09 PM

I have one more question on this code. It works well, but if the input is not divisible by BUFSIZ, then some of the input is discarded. It isn't too much of an issue, but I'd like to know if can get the rest of the input without too much performance hit. Performance is more important than getting a few more bytes of input. This is read from stdin, so I can't calculate sizes beforehand and do it that way.

johnsfine · 06-05-2013, 02:06 PM

Quote:

Originally Posted by H_TeXMeX_H

if the input is not divisible by BUFSIZ, then some of the input is discarded.

Have you tried fread(cluster, 1, BUFSIZ, stdin) instead of fread(cluster, BUFSIZ, 1, stdin) ?

I would not expect that to affect performance much, but you won't know until you try.

With that change, it is trivial to know how much you read and process the correct amount.

Code:

	unsigned r;
	while ( 0!= (r=fread(cluster, 1, BUFSIZ, stdin)) )
	{
		for (i=0; i < r; i++)
		{
			count[cluster[i]]++;
		}
	}

H_TeXMeX_H · 06-06-2013, 02:17 AM

Thanks, it works !

I tried to do something with the return value of fread, but I didn't think to put it in the for loop.

Alright, it is solved, and performance is the same.

mina86 · 06-06-2013, 07:32 AM

By the way, on unrelated note, instead of defining “NUMCHARMAX” use “UCHAR_MAX” from “limits.h” header file.

Back on topic, if you are really interested, you could play with “posix_fadvise”, for instance:

Code:

#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define BUFSIZE 4096

int main(void) {
	unsigned long count[UCHAR_MAX + 1];
	unsigned char *buf, *ch;
	ssize_t i, ret;
	off_t pos = 0;

	for (i = 0; i <= UCHAR_MAX; ++i) {
		count[i] = 0;
	}

	buf = malloc(BUFSIZE);
	for (;;) {
		ret = read(0, buf, BUFSIZE);
		if (ret > 0) {
			pos += ret;
			posix_fadvise(0, pos, BUFSIZE,
			              POSIX_FADV_SEQUENTIAL |
			              POSIX_FADV_NOREUSE |
			              POSIX_FADV_WILLNEED);
			for (ch = buf; ret; --ret, ++ch) {
				++count[*ch];
			}
		} else if (!ret) {
			break;
		} else if (errno != EINTR) {
			fprintf(stderr, "%s\n", strerror(errno));
			return 1;
		}
	}
	free(buf);

	for (i = 0; i <= UCHAR_MAX; ++i) {
		printf("%u %lu\n", (unsigned)i, count[i]);
	}

	return 0;
}

But with such a simple code it's unlikely that you'll notice any improvement.

H_TeXMeX_H · 06-06-2013, 07:35 AM

@ mina86

I tried that code, but it is slower. You might be right that I should use UCHAR_MAX, but it is more convenient to use 256 instead of 255 + 1 in some calculations. I suppose I could define a constant based on UCHAR_MAX.

bloody · 06-06-2013, 11:41 AM

Use at least 64K buffer size, the bigger the better. Up to 4MB can sometimes help.

Guttorm · 06-06-2013, 03:56 PM

Hi

I think It should be slightly faster to use fadvise, mmap and no buffer at all.

Code:

#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
  int i, fd;
  unsigned long count[UCHAR_MAX + 1];
  memset(count,0,sizeof(long)*(UCHAR_MAX+1));
  if (argc != 2) {
    printf("No filename.\n");
    return 1;
  }
  const char *filename = argv[1];
  fd = open(filename,O_RDONLY);
  struct stat s;
  fstat (fd, &s);
  size_t size = s.st_size;
  posix_fadvise(fd, 0, size, POSIX_FADV_SEQUENTIAL | POSIX_FADV_NOREUSE | POSIX_FADV_WILLNEED);
  const unsigned char *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
  if (data == MAP_FAILED) {
    perror("mmap");
    return 1;
  }
  for ( i=0; i<size; i++ ) {
    unsigned char c = data[i];
    count[c]++;
  }
  for (i = 0; i <= UCHAR_MAX; ++i) {
    printf("%u %lu\n", (unsigned)i, count[i]);
  }
  return 0;
}

H_TeXMeX_H · 06-07-2013, 02:35 AM

It's still slower, 0.159s versus 0.131s for 100 MB file.

I have already tried large buffers, but the BUFSIZ defined in stdin.h seems to be optimal. Larger doesn't help, smaller is slower.

mina86 · 06-07-2013, 09:01 AM

Quote:

Originally Posted by H_TeXMeX_H

I tried that code, but it is slower.

Actually, I haven't said it would be faster. The test application is too simple to optimise. Linux most likely does a good job at optimising it anyway. But if you have a bigger problem, using “posix_fadvise” might be useful.

Quote:

Originally Posted by H_TeXMeX_H

You might be right that I should use UCHAR_MAX, but it is more convenient to use 256 instead of 255 + 1 in some calculations.

You should use UCHAR_MAX because you avoid using literals in your code which are not defined by the standard by the way. Not using literal values is almost always a good thing.

Quote:

Originally Posted by Guttorm

Code:

  const unsigned char *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

This will fail if the file is very big, say 1T.