[SOLVED] Optimize simple C code

H_TeXMeX_H · 06-09-2013, 03:54 AM

Alright, here is the final program

Code:

// output character statistics
#include <stdio.h>
#include <math.h>
#include <limits.h>

int main (void)
{
	// ! init
	unsigned char cluster[BUFSIZ];
	unsigned int i;
	unsigned int ret;

	// init
	const unsigned int uchartot = UCHAR_MAX + 1; // = 256

	unsigned long count[uchartot];

	for (i = 0; i < uchartot; i++)
	{
		count[i] = 0;
	}

	// input loop
	while ((ret = fread (cluster, 1, BUFSIZ, stdin)))
	{
		for (i = 0; i < ret; i++)
		{
			count[cluster[i]]++;
		}
	}

	// total calc
	for (i = 0; i < uchartot; i++)
	{
		printf ("%u\t%lu\n", i, count[i]);
	}

	return 0;
}

The thread is solved.

mina86 · 06-10-2013, 03:41 AM

By the way, you could do w/o uchartot:

Code:

#include <stdio.h>
#include <math.h>
#include <limits.h>

int main (void)
{
	// ! init
	unsigned char cluster[BUFSIZ];
	unsigned int i;
	unsigned int ret;

	// init
	unsigned long count[UCHAR_MAX + 1];
	for (i = 0; i < sizeof count / sizeof *count; i++)
	{
		count[i] = 0;
	}

	// input loop
	while ((ret = fread (cluster, 1, BUFSIZ, stdin)))
	{
		for (i = 0; i < ret; i++)
		{
			count[cluster[i]]++;
		}
	}

	// total calc
	for (i = 0; i < sizeof count / sizeof *count; i++)
	{
		printf ("%u\t%lu\n", i, count[i]);
	}

	return 0;
}

I usually define a macro for that:

Code:

#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof *(arr))

H_TeXMeX_H · 06-10-2013, 07:54 AM

That code is slightly slower, probably because you're doing extra math as part of the loop. That's why I prefer to do the math before the loop. Technically, the compiler should be able to detect and calculate it and avoid the slight overhead.

mina86 · 06-10-2013, 10:21 AM

Quote:

Originally Posted by H_TeXMeX_H

That code is slightly slower, probably because you're doing extra math as part of the loop.

If your compiler does not optimise the calculation, than change your compiler… I don't believe that it does not optimise this.

H_TeXMeX_H · 07-21-2013, 08:06 AM

I found a way to optimize it just a bit more, especially for 64-bit processors.

Code:

// output character statistics
#include <stdio.h>
#include <math.h>
#include <stdint.h>

int main (void)
{
	// ! init
	uint64_t cluster[BUFSIZ];
	unsigned int i;
	unsigned int ret;

	// init
	const unsigned int uint8tot = UINT8_MAX + 1; // = 256

	unsigned long count[uint8tot];

	for (i = 0; i < uint8tot; i++)
	{
		count[i] = 0;
	}

	// input loop
	while ((ret = fread (cluster, 8, BUFSIZ, stdin)))
	{
		for (i = 0; i < ret; i++)
		{
			count[ cluster[i]        & 0xff]++;
			count[(cluster[i] >>  8) & 0xff]++;
			count[(cluster[i] >> 16) & 0xff]++;
			count[(cluster[i] >> 24) & 0xff]++;
			count[(cluster[i] >> 32) & 0xff]++;
			count[(cluster[i] >> 40) & 0xff]++;
			count[(cluster[i] >> 48) & 0xff]++;
			count[(cluster[i] >> 56) & 0xff]++;
		}
	}

	// total calc
	for (i = 0; i < uint8tot; i++)
	{
		printf ("%3u\t%lu\n", i, count[i]);
	}

	return 0;
}

This does it in 0.113s, versus 0.132s.

Probably this compiles to better assembly code ?

psionl0 · 07-21-2013, 10:38 AM

Would it not be faster to read the entire file into memory then do the computations?

H_TeXMeX_H · 07-21-2013, 11:47 AM

Quote:

Originally Posted by psionl0

Would it not be faster to read the entire file into memory then do the computations?

Not really, and this would also limit the input file size. I would also prefer that the data be piped in.

Code:

// output character statistics
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
	// parse args
	if (2 != argc)
	{
		printf ("Usage:\n%s filename\n", argv[0]);
		return 1;
	}

	// ! init
	unsigned int i;
	long fsize;

	// init
	const unsigned int uint8tot = UINT8_MAX + 1; // = 256

	unsigned long count[uint8tot];
	for (i = 0; i < uint8tot; i++)
	{
		count[i] = 0;
	}

	// open file
	FILE * fp = fopen (argv[1], "rb");
	if (NULL == fp)
	{
		printf ("ERROR: Cannot open %s\n", argv[1]);
		return 2;
	}

	// obtain file size
	fseek (fp, 0, SEEK_END);
	fsize = ftell (fp);
	rewind (fp);

	// malloc buffer
	uint8_t * buffer = (uint8_t *) malloc (fsize);
	if (NULL == buffer)
	{
		printf ("ERROR: Not enough memory to contain %ld byte %s\n", fsize, argv[1]);
		return 3;
	}

	// copy file into buffer
	if (! fread (buffer, 1, fsize, fp))
	{
		printf ("ERROR: Could not read %s\n", argv[1]);
		return 4;
	}

	// close file
	fclose (fp);

	// input loop
	for (i = 0; i < fsize; i++)
	{
		count[buffer[i]]++;
	}

	// total calc
	for (i = 0; i < uint8tot; i++)
	{
		printf ("%3u\t%lu\n", i, count[i]);
	}

	free (buffer);

	return 0;
}

This is about twice as slow: 0.201s versus 0.113s.

ta0kira · 07-23-2013, 08:25 AM

I think your test cases are too small to make real comparisons. I assume you're not concerned with saving yourself 0.1 seconds of personal time, so why not run some tests that take several minutes or an hour?

Quote:

Originally Posted by mina86

This will fail if the file is very big, say 1T.

This really depends on pointer size and resource limits on address space. You definitely can't always count on being allowed 1TB of address space. I prefer using mmap, but the real problem here is that there's no good reason to prevent the program from reading from a descriptor that corresponds to a pipe or socket (or to a file on a filesystem without mmap support.)

Kevin Barry

H_TeXMeX_H · 07-23-2013, 08:43 AM

The time scales quite linearly.

As an example, piping a PRNG into this program (slight overhead from generating the random numbers):
100 MiB = 0.177s
1000 MiB = 1.708s
10000 MiB = 17.451s

H_TeXMeX_H · 07-23-2013, 10:05 AM

I have also tried an SSE2 version, but it is actually a bit slower (0.116s versus 0.113s):

Code:

// output character statistics
#include <stdio.h>
#include <math.h>
#include <stdint.h>

#include <emmintrin.h>

int main (void)
{
	// ! init
	__m128i cluster[BUFSIZ];
	unsigned int i;
	unsigned int ret;
	unsigned int chunk;

	// init
	const unsigned int uint8tot = UINT8_MAX + 1; // = 256

	unsigned long count[uint8tot];

	for (i = 0; i < uint8tot; i++)
	{
		count[i] = 0;
	}

	// input loop
	while ((ret = fread (cluster, 16, BUFSIZ, stdin)))
	{
		for (i = 0; i < ret; i++)
		{
			chunk = _mm_extract_epi16 (cluster[i], 0);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 1);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 2);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 3);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 4);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 5);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 6);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;

			chunk = _mm_extract_epi16 (cluster[i], 7);
			count[chunk & 0xff]++;
			count[(chunk >> 8) & 0xff]++;
		}
	}

	// total calc
	for (i = 0; i < uint8tot; i++)
	{
		printf ("%3u\t%lu\n", i, count[i]);
	}

	return 0;
}

That pretty much concludes my attempts to optimize this code. I think it is the best it can be, while remaining portable (no assembler code). This is completely solved now.

ta0kira · 07-23-2013, 01:06 PM

I'm sure the postincrement operations are optimized out, but just in case, try using preincrement. You could also directly increment a pointer to the buffer instead of using an index, e.g.

Code:

	for (uint8_t *current = buffer, *end = buffer + fsize; current < end;)
	{
		++count[*current++];
	}

Lastly, trying to load the entire file at once isn't a good idea since you could inadvertently DoS yourself, and it won't work if you want to read from a pipe, tty, or socket (i.e. what stdin is most of the time.)

Kevin Barry

H_TeXMeX_H · 07-24-2013, 03:44 AM

Tried it, but it doesn't help. I can't do exactly that, because I don't always have an fsize, but even if I use the buffer size, it is slower.

konsolebox · 08-03-2013, 03:18 AM

I think multiple ioctl calls with printf is what's really making it slow. Perhaps using multiple sprintf to a buffer first then run one-time call to print to stdout with fwrite would make it faster.

Edit: Oh sorry perhaps it won't really be helpful with just a small number (256). If you could consider iota() which is not standard it could also help as it skips the parsing of the format string. If you could create your own itoa() function then it would also be better. You just have to be careful. Something that could return the length of the generated string would also be nice since it could help you know where to write next on the buffer. This obviously is not certainly portable and output may differ on some architectures or platforms.

H_TeXMeX_H · 08-03-2013, 01:39 PM

You are right that printf slows it down a bit, but I also do other things with the core program that requires only a single printf.

ta0kira · 08-03-2013, 02:01 PM

The output duration is going to be negligible with data sizes large enough to be important. Disk seek times and hardware interrupts will wash out all of that optimization.

Why are you using stdio.h for input? You're just adding unnecessary cycles to the input loop. Also, why not fstat the file to get the optimal block size (st_blksize) for input?

You should try your program with a huge file that's actually on the filesystem. If your process is running at less-than 100% CPU then there's a good chance you're needlessly optimizing.

Kevin Barry