[SOLVED] Optimize simple C code

H_TeXMeX_H · 08-03-2013, 02:11 PM

Quote:

Originally Posted by ta0kira

You should try your program with a huge file that's actually on the filesystem. If your process is running at less-than 100% CPU then there's a good chance you're needlessly optimizing.

Yeah, you're right, it uses less than 100%, so it may be needless optimization. However, I often pipe data from a fast PRNG to this program.

ta0kira · 08-03-2013, 02:23 PM

If you really wanted to overdo it, you could:

Set up n worker threads.
Have the main thread read a page or two of data.
Release all threads to count bytes for assigned sections of the input buffer.
While the threads are counting, read the next section into a separate buffer.
Etc.
Sum up the counts from each of the threads and print the output.

This is something I did recently for a different sort of file analysis and it worked well. That involved looking up words from the file in a binary tree and tens-of-thousands of floating-point computations per byte, but the essential pattern is the same.

But, like I implied in my previous post, your bottleneck is probably going to be reading input fast enough, not the algorithm processing the data. I think you'd get maximum performance with one thread for reading and one thread for counting, with the read thread one iteration ahead and a pthread barrier to make sure they stay synchronized.

Kevin Barry

H_TeXMeX_H · 08-05-2013, 02:52 AM

Alright, I will try to thread the program. I have tried in the past, but I am new to threading and I couldn't get something that works.

ta0kira · 08-05-2013, 09:40 AM

It's actually a little tricky to avoid race conditions when the reading and processing happen in different threads. Here's the basic model I used most recently. The main goal of this model is to make the total processing time identical to the time it takes to read the input (plus a constant for postprocessing and output.) In your case, you will probably achieve that with 1 worker thread.

Global data:

Code:

//data that will be operated on, e.g. a buffer of input data
static struct { /*todo: stuff*/ } data;

static pthread_barrier_t barrier; //each thread blocks on this until all threads have done so

Main thread:

Code:

//todo: initialize 'barrier' (barrier count needs to be #worker-threads + 1)

//todo: start worker threads with 'pthread_create'


while (/*todo: read a single block of data*/)
{
  //todo: preprocess the input if necessary


  //first iteration:       wait for the worker threads to initialize
  //subsequent iterations: wait for the worker threads to finish the last iteration
  pthread_barrier_wait(&barrier); //[A], [C]

  //(worker threads immediately block again on the barrier at [B])


  //todo: update 'data' as necessary for the next iteration, while the worker threads are stopped
  //(in this case, you might have 2 buffers that you alternate assigning to 'data')


  //wait for the worker threads get to the "starting line"
  pthread_barrier_wait(&barrier); //[B]

  //(at this point the worker threads are working, and the main thread can read the next block)
}


//wait for the worker threads to finish the last iteration
pthread_barrier_wait(&barrier); //[C], or [A] if the input is empty


//todo: postprocessing, output, etc.

//todo: 'pthread_cancel' all threads, but don't 'pthread_join' since they don't actually return

Worker threads:

Code:

//todo: initialize thread data (e.g. buffers)


//wait for all other worker threads to initialize
pthread_barrier_wait(&barrier); //[A]


while (1)
{
  //wait for the other worker threads get to the "starting line"
  pthread_barrier_wait(&barrier); //[B]

  //todo: process the data here!

  //wait for the other worker threads to finish this iteration
  pthread_barrier_wait(&barrier); //[C]
}

I actually generalized this a bit more and used function pointers so I could use the worker threads for different tasks at different points in the code. For me, figuring out this sort of pattern was the hardest part about learning threads. I'm not sure if this is a common pattern, but it's the one I usually use and it seems to work well.

Kevin Barry

H_TeXMeX_H · 08-05-2013, 01:53 PM

For 100 MB:
1 threads = 0.430s
2 threads = 0.448s
4 threads = 0.526s

I have learned from it tho, it seems pthread is much much easier to use and understand than openmp.

ta0kira · 08-05-2013, 03:02 PM

Quote:

Originally Posted by H_TeXMeX_H

For 100 MB:
1 threads = 0.430s
2 threads = 0.448s
4 threads = 0.526s

I have learned from it tho, it seems pthread is much much easier to use and understand than openmp.

There isn't a single "100MB" scenario. Was it read from the filesystem? Was it piped? Was it a file that you've been repeatedly testing your program on (increasing the chances of it being entirely cached in RAM)? The linear scaling you demonstrated in the other post doesn't apply to data that must be read from the filesystem. If you were piping the data, I don't doubt that the threading slowed it down, since "reading" really meant copying bytes from one RAM location to another. The point is that when you read from a disk the read operations take long enough so that all of the thread coordination and data-processing happens during those periods.

Kevin Barry

H_TeXMeX_H · 08-06-2013, 03:15 AM

I tried piping data in as well.

1000 MB:
1 threads = 137 MB/s
4 threads = 118 MB/s

ta0kira · 08-06-2013, 09:13 AM

Quote:

Originally Posted by H_TeXMeX_H

I tried piping data in as well.

1000 MB:
1 threads = 137 MB/s
4 threads = 118 MB/s

That doesn't sound right. Threading alone doesn't make things 8x slower. Can you post your code? You must be doing a malloc every iteration or something.

Kevin Barry

H_TeXMeX_H · 08-06-2013, 10:06 AM

Code:

// output character statistics
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <pthread.h>

#define UINT8_TOT (UINT8_MAX + 1) // = 256

struct data
{
	unsigned long count[UINT8_TOT];
	uint64_t * cluster;
};

void *counter (void *ptr)
{
	// ! init
	unsigned int i;

	// init
	struct data * d = (struct data *) ptr;

	for (i = 0; i < BUFSIZ / 4; i++)
	{
		d->count[ d->cluster[i]        & 0xff]++;
		d->count[(d->cluster[i] >>  8) & 0xff]++;
		d->count[(d->cluster[i] >> 16) & 0xff]++;
		d->count[(d->cluster[i] >> 24) & 0xff]++;
		d->count[(d->cluster[i] >> 32) & 0xff]++;
		d->count[(d->cluster[i] >> 40) & 0xff]++;
		d->count[(d->cluster[i] >> 48) & 0xff]++;
		d->count[(d->cluster[i] >> 56) & 0xff]++;
	}

}

int main (void)
{
	// ! init
	uint64_t cluster[BUFSIZ];
	unsigned int i;
	unsigned int ret;
	pthread_t thread1, thread2, thread3, thread4;

	// init
	struct data d1 = {};
	struct data d2 = {};
	struct data d3 = {};
	struct data d4 = {};

	// input loop
	while ((ret = fread (cluster, 8, BUFSIZ, stdin)))
	{
		d1.cluster = cluster;
		pthread_create (&thread1, NULL, counter, (void*) &d1);
		d2.cluster = cluster + (BUFSIZ / 4);
		pthread_create (&thread2, NULL, counter, (void*) &d2);
		d3.cluster = cluster + (BUFSIZ / 4) + (BUFSIZ / 4);
		pthread_create (&thread3, NULL, counter, (void*) &d3);
		d4.cluster = cluster + (BUFSIZ / 4) + (BUFSIZ / 4) + (BUFSIZ / 4);
		pthread_create (&thread4, NULL, counter, (void*) &d4);

		pthread_join (thread1, NULL);
		pthread_join (thread2, NULL);
		pthread_join (thread3, NULL);
		pthread_join (thread4, NULL);
	}

	for (i = 0; i < UINT8_TOT; i++)
	{
		d1.count[i] += d2.count[i] + d3.count[i] + d4.count[i];
	}

	// total calc
	for (i = 0; i < UINT8_TOT; i++)
	{
		printf ("%3u\t%lu\n", i, d1.count[i]);
	}

	return 0;
}

ta0kira · 08-06-2013, 10:42 AM

There are several problems with your implementation:

Creating a new set of threads every iteration will destroy your performance because you're effectively creating a new process that shares the same process image each time. You need the pthread_barrier_t because the threads are supposed to persist, and therefore must start and stop.
The code you posted will miss the end of the block if the block size isn't a multiple of the number of threads, and it will process garbage if less-than a full block is read.
You haven't initialized your count arrays.

Here's my version:

Code:

#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <pthread.h>


#define BUFFER_SIZE (64*4096)

#define BARRIER_WAIT \
{ int result = pthread_barrier_wait(&barrier); \
  assert(!result || result == PTHREAD_BARRIER_SERIAL_THREAD); }


static struct {
    unsigned char *buffer;
    unsigned long  size;
} data;


typedef struct {
    unsigned long *counts;
    unsigned int thread, threads;
} thread_spec;


static pthread_barrier_t barrier;

static void *thread_loop(thread_spec*);



int main(int argc, char *argv[])
{
    if (argc < 3)
    {
    fprintf(stderr, "%s [threads] [filename(s)...]\n", argv[0]);
    return 1;
    }


    int threads = 0;
    char error = 0;
    ssize_t read_size = 0;

    int I, J, parity = 0;


    if (sscanf(argv[1], "%i%c", &threads, &error) != 1 || threads < 0)
    {
    fprintf(stderr, "%s: invalid thread count '%s'\n", argv[0], argv[1]);
    return 1;
    }


    unsigned char *buffer1 = NULL, *buffer2 = NULL;

    assert( buffer1 = (unsigned char*) malloc(BUFFER_SIZE) );
    assert( buffer2 = (unsigned char*) malloc(BUFFER_SIZE) );


    pthread_t      *all_threads = NULL;
    unsigned long **all_counts  = NULL;


    if (threads)
    {
    assert( pthread_barrier_init(&barrier, NULL, threads + 1) == 0 );
    assert( all_threads = (pthread_t*) malloc(threads * sizeof(pthread_t)) );
    assert( all_counts = (unsigned long**) malloc(threads * sizeof(unsigned long*)) );

    for (I = 0; I < threads; I++)
     {
    thread_spec *new_spec = (thread_spec*) malloc(sizeof(thread_spec));
    assert(new_spec);

    assert( all_counts[I] = new_spec->counts = (unsigned long*) calloc(256, sizeof(unsigned long)) );
    new_spec->thread  = I;
    new_spec->threads = threads;

    assert( pthread_create(all_threads + I, NULL, (void*(*)(void*)) &thread_loop, (void*) new_spec) == 0 );
     }
    }


    else
    {
    assert( all_counts = (unsigned long**) malloc(sizeof(unsigned long*)) );
    assert( all_counts[0] = (unsigned long*) calloc(256, sizeof(unsigned long)) );
    }


    for (I = 2; I < argc; I++)
    {
    int current_file = open(argv[I], O_RDONLY);
    if (current_file < 0)
     {
    fprintf(stderr, "%s: unable to open file '%s': %s\n", argv[0], argv[I], strerror(errno));
    continue;
     }


    while ( (read_size = read(current_file, parity? buffer1 : buffer2, BUFFER_SIZE)) > 0 )
     {
    if (threads)
      {
    BARRIER_WAIT

    data.buffer = parity? buffer1 : buffer2;
    data.size   = read_size;
    parity      = !parity;

    BARRIER_WAIT
      }

    else
      {
    for (J = 0; J < read_size; J++) ++all_counts[0][ (parity? buffer1 : buffer2)[J] ];
      }
     }

    close(current_file);
    }


    if (threads) BARRIER_WAIT


    for (I = 0; I < 256; I++)
    {
    for (J = 1; J < threads; J++)
    all_counts[0][I] += all_counts[J][I];

    fprintf(stdout, "%i\t%lu\n", I, all_counts[0][I]);
    }


    if (threads)
    {
    for (I = 0; I < threads; I++)
     {
    pthread_cancel(all_threads[I]);
    free(all_counts[I]);
     }

    free(all_counts);
    free(all_threads);
    pthread_barrier_destroy(&barrier);
    }

    else
    {
    free(all_counts[0]);
    free(all_counts);
    }


    return 0;
}



static void *thread_loop(thread_spec *spec)
{
    assert(spec);

    unsigned long *counts = spec->counts;
    unsigned int thread = spec->thread, threads = spec->threads;

    assert(counts);
    assert(threads);

    int I;

    BARRIER_WAIT


    while (1)
    {
    BARRIER_WAIT

    for (I = thread; I < data.size; I += threads) ++counts[ data.buffer[I] ];

    BARRIER_WAIT
    }


    //(we'll never get here)
    free(spec);

    return NULL;
}

Name it count.c and build it with gcc -Wall -g -O3 count.c -lpthread -o count. To run it:

Code:

$ ./count 0 filename #(no worker threads; just the main thread)
$ ./count n filename #(n worker threads)

I have a dual-core hyperthreaded i5, and for a 1GB file on the hard-drive I get the best performance with 2 threads: about 0.65s with 2 threads, 0.91s with 0 threads. With 1 thread or 4+ threads I get slightly-worse performance than with 0 threads. I get very different results without -O3, though.

edit: When piping 1GB using dd from either /dev/zero or /dev/urandom, it takes 1.3s with 2 threads and 1.8s with 0 threads.

Kevin Barry

H_TeXMeX_H · 08-06-2013, 12:52 PM

Now that is faster. I couldn't figure out how to use pthread_barrier_wait, so I resorted to join instead.

Your code is faster, especially with 3 threads, 0.315s versus 0.434s.

However, it is the same as fast when piping from a PRNG, both 142 MB/s.

ta0kira · 08-06-2013, 01:14 PM

Quote:

Originally Posted by H_TeXMeX_H

Now that is faster. I couldn't figure out how to use pthread_barrier_wait, so I resorted to join instead.

pthread_join waits for the thread to exit, whereas pthread_barrier_wait causes the calling thread to block until a specific number of other threads are blocked on the same barrier. In the case of my code (and template further above,) the barrier needs n+1 threads to block on it, at which point all of the threads can continue. It's like saying, "there are n+1 people in our hiking group; let's all wait here until the rest catch up."

Quote:

Originally Posted by H_TeXMeX_H

However, it is the same as fast when piping from a PRNG, both 142 MB/s.

Keep in mind that the read speed is what dictates whether or not a particular number of threads is effective. It's possible that the PRNG can't generate data as quickly as the counter program can process it.

Kevin Barry

H_TeXMeX_H · 08-06-2013, 02:01 PM

Alright, thanks again for the info, I'll have to practice using pthread_barrier_wait as it seems very useful and maybe I can optimize other code using it.