LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 08-03-2013, 03:11 PM   #31
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269

Quote:
Originally Posted by ta0kira View Post
You should try your program with a huge file that's actually on the filesystem. If your process is running at less-than 100% CPU then there's a good chance you're needlessly optimizing.
Yeah, you're right, it uses less than 100%, so it may be needless optimization. However, I often pipe data from a fast PRNG to this program.
 
Old 08-03-2013, 03:23 PM   #32
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
If you really wanted to overdo it, you could:
  1. Set up n worker threads.
  2. Have the main thread read a page or two of data.
  3. Release all threads to count bytes for assigned sections of the input buffer.
  4. While the threads are counting, read the next section into a separate buffer.
  5. Etc.
  6. Sum up the counts from each of the threads and print the output.
This is something I did recently for a different sort of file analysis and it worked well. That involved looking up words from the file in a binary tree and tens-of-thousands of floating-point computations per byte, but the essential pattern is the same.

But, like I implied in my previous post, your bottleneck is probably going to be reading input fast enough, not the algorithm processing the data. I think you'd get maximum performance with one thread for reading and one thread for counting, with the read thread one iteration ahead and a pthread barrier to make sure they stay synchronized.

Kevin Barry
 
1 members found this post helpful.
Old 08-05-2013, 03:52 AM   #33
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Alright, I will try to thread the program. I have tried in the past, but I am new to threading and I couldn't get something that works.
 
Old 08-05-2013, 10:40 AM   #34
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
It's actually a little tricky to avoid race conditions when the reading and processing happen in different threads. Here's the basic model I used most recently. The main goal of this model is to make the total processing time identical to the time it takes to read the input (plus a constant for postprocessing and output.) In your case, you will probably achieve that with 1 worker thread.

Global data:
Code:
//data that will be operated on, e.g. a buffer of input data
static struct { /*todo: stuff*/ } data;

static pthread_barrier_t barrier; //each thread blocks on this until all threads have done so
Main thread:
Code:
//todo: initialize 'barrier' (barrier count needs to be #worker-threads + 1)

//todo: start worker threads with 'pthread_create'


while (/*todo: read a single block of data*/)
{
  //todo: preprocess the input if necessary


  //first iteration:       wait for the worker threads to initialize
  //subsequent iterations: wait for the worker threads to finish the last iteration
  pthread_barrier_wait(&barrier); //[A], [C]

  //(worker threads immediately block again on the barrier at [B])


  //todo: update 'data' as necessary for the next iteration, while the worker threads are stopped
  //(in this case, you might have 2 buffers that you alternate assigning to 'data')


  //wait for the worker threads get to the "starting line"
  pthread_barrier_wait(&barrier); //[B]

  //(at this point the worker threads are working, and the main thread can read the next block)
}


//wait for the worker threads to finish the last iteration
pthread_barrier_wait(&barrier); //[C], or [A] if the input is empty


//todo: postprocessing, output, etc.

//todo: 'pthread_cancel' all threads, but don't 'pthread_join' since they don't actually return
Worker threads:
Code:
//todo: initialize thread data (e.g. buffers)


//wait for all other worker threads to initialize
pthread_barrier_wait(&barrier); //[A]


while (1)
{
  //wait for the other worker threads get to the "starting line"
  pthread_barrier_wait(&barrier); //[B]

  //todo: process the data here!

  //wait for the other worker threads to finish this iteration
  pthread_barrier_wait(&barrier); //[C]
}
I actually generalized this a bit more and used function pointers so I could use the worker threads for different tasks at different points in the code. For me, figuring out this sort of pattern was the hardest part about learning threads. I'm not sure if this is a common pattern, but it's the one I usually use and it seems to work well.

Kevin Barry

Last edited by ta0kira; 08-05-2013 at 10:49 AM. Reason: added "todo" to make places in need of code more explicit
 
1 members found this post helpful.
Old 08-05-2013, 02:53 PM   #35
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
For 100 MB:
1 threads = 0.430s
2 threads = 0.448s
4 threads = 0.526s

I have learned from it tho, it seems pthread is much much easier to use and understand than openmp.
 
Old 08-05-2013, 04:02 PM   #36
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by H_TeXMeX_H View Post
For 100 MB:
1 threads = 0.430s
2 threads = 0.448s
4 threads = 0.526s

I have learned from it tho, it seems pthread is much much easier to use and understand than openmp.
There isn't a single "100MB" scenario. Was it read from the filesystem? Was it piped? Was it a file that you've been repeatedly testing your program on (increasing the chances of it being entirely cached in RAM)? The linear scaling you demonstrated in the other post doesn't apply to data that must be read from the filesystem. If you were piping the data, I don't doubt that the threading slowed it down, since "reading" really meant copying bytes from one RAM location to another. The point is that when you read from a disk the read operations take long enough so that all of the thread coordination and data-processing happens during those periods.

Kevin Barry

Last edited by ta0kira; 08-05-2013 at 04:15 PM.
 
Old 08-06-2013, 04:15 AM   #37
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
I tried piping data in as well.

1000 MB:
1 threads = 137 MB/s
4 threads = 118 MB/s
 
Old 08-06-2013, 10:13 AM   #38
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by H_TeXMeX_H View Post
I tried piping data in as well.

1000 MB:
1 threads = 137 MB/s
4 threads = 118 MB/s
That doesn't sound right. Threading alone doesn't make things 8x slower. Can you post your code? You must be doing a malloc every iteration or something.

Kevin Barry
 
Old 08-06-2013, 11:06 AM   #39
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Code:
// output character statistics
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <pthread.h>

#define UINT8_TOT (UINT8_MAX + 1) // = 256

struct data
{
	unsigned long count[UINT8_TOT];
	uint64_t * cluster;
};

void *counter (void *ptr)
{
	// ! init
	unsigned int i;

	// init
	struct data * d = (struct data *) ptr;

	for (i = 0; i < BUFSIZ / 4; i++)
	{
		d->count[ d->cluster[i]        & 0xff]++;
		d->count[(d->cluster[i] >>  8) & 0xff]++;
		d->count[(d->cluster[i] >> 16) & 0xff]++;
		d->count[(d->cluster[i] >> 24) & 0xff]++;
		d->count[(d->cluster[i] >> 32) & 0xff]++;
		d->count[(d->cluster[i] >> 40) & 0xff]++;
		d->count[(d->cluster[i] >> 48) & 0xff]++;
		d->count[(d->cluster[i] >> 56) & 0xff]++;
	}

}

int main (void)
{
	// ! init
	uint64_t cluster[BUFSIZ];
	unsigned int i;
	unsigned int ret;
	pthread_t thread1, thread2, thread3, thread4;

	// init
	struct data d1 = {};
	struct data d2 = {};
	struct data d3 = {};
	struct data d4 = {};

	// input loop
	while ((ret = fread (cluster, 8, BUFSIZ, stdin)))
	{
		d1.cluster = cluster;
		pthread_create (&thread1, NULL, counter, (void*) &d1);
		d2.cluster = cluster + (BUFSIZ / 4);
		pthread_create (&thread2, NULL, counter, (void*) &d2);
		d3.cluster = cluster + (BUFSIZ / 4) + (BUFSIZ / 4);
		pthread_create (&thread3, NULL, counter, (void*) &d3);
		d4.cluster = cluster + (BUFSIZ / 4) + (BUFSIZ / 4) + (BUFSIZ / 4);
		pthread_create (&thread4, NULL, counter, (void*) &d4);

		pthread_join (thread1, NULL);
		pthread_join (thread2, NULL);
		pthread_join (thread3, NULL);
		pthread_join (thread4, NULL);
	}

	for (i = 0; i < UINT8_TOT; i++)
	{
		d1.count[i] += d2.count[i] + d3.count[i] + d4.count[i];
	}

	// total calc
	for (i = 0; i < UINT8_TOT; i++)
	{
		printf ("%3u\t%lu\n", i, d1.count[i]);
	}

	return 0;
}
 
Old 08-06-2013, 11:42 AM   #40
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
There are several problems with your implementation:
  1. Creating a new set of threads every iteration will destroy your performance because you're effectively creating a new process that shares the same process image each time. You need the pthread_barrier_t because the threads are supposed to persist, and therefore must start and stop.
  2. The code you posted will miss the end of the block if the block size isn't a multiple of the number of threads, and it will process garbage if less-than a full block is read.
  3. You haven't initialized your count arrays.
Here's my version:
Code:
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <pthread.h>


#define BUFFER_SIZE (64*4096)

#define BARRIER_WAIT \
{ int result = pthread_barrier_wait(&barrier); \
  assert(!result || result == PTHREAD_BARRIER_SERIAL_THREAD); }


static struct {
    unsigned char *buffer;
    unsigned long  size;
} data;


typedef struct {
    unsigned long *counts;
    unsigned int thread, threads;
} thread_spec;


static pthread_barrier_t barrier;

static void *thread_loop(thread_spec*);



int main(int argc, char *argv[])
{
    if (argc < 3)
    {
    fprintf(stderr, "%s [threads] [filename(s)...]\n", argv[0]);
    return 1;
    }


    int threads = 0;
    char error = 0;
    ssize_t read_size = 0;

    int I, J, parity = 0;


    if (sscanf(argv[1], "%i%c", &threads, &error) != 1 || threads < 0)
    {
    fprintf(stderr, "%s: invalid thread count '%s'\n", argv[0], argv[1]);
    return 1;
    }


    unsigned char *buffer1 = NULL, *buffer2 = NULL;

    assert( buffer1 = (unsigned char*) malloc(BUFFER_SIZE) );
    assert( buffer2 = (unsigned char*) malloc(BUFFER_SIZE) );


    pthread_t      *all_threads = NULL;
    unsigned long **all_counts  = NULL;


    if (threads)
    {
    assert( pthread_barrier_init(&barrier, NULL, threads + 1) == 0 );
    assert( all_threads = (pthread_t*) malloc(threads * sizeof(pthread_t)) );
    assert( all_counts = (unsigned long**) malloc(threads * sizeof(unsigned long*)) );

    for (I = 0; I < threads; I++)
     {
    thread_spec *new_spec = (thread_spec*) malloc(sizeof(thread_spec));
    assert(new_spec);

    assert( all_counts[I] = new_spec->counts = (unsigned long*) calloc(256, sizeof(unsigned long)) );
    new_spec->thread  = I;
    new_spec->threads = threads;

    assert( pthread_create(all_threads + I, NULL, (void*(*)(void*)) &thread_loop, (void*) new_spec) == 0 );
     }
    }


    else
    {
    assert( all_counts = (unsigned long**) malloc(sizeof(unsigned long*)) );
    assert( all_counts[0] = (unsigned long*) calloc(256, sizeof(unsigned long)) );
    }


    for (I = 2; I < argc; I++)
    {
    int current_file = open(argv[I], O_RDONLY);
    if (current_file < 0)
     {
    fprintf(stderr, "%s: unable to open file '%s': %s\n", argv[0], argv[I], strerror(errno));
    continue;
     }


    while ( (read_size = read(current_file, parity? buffer1 : buffer2, BUFFER_SIZE)) > 0 )
     {
    if (threads)
      {
    BARRIER_WAIT

    data.buffer = parity? buffer1 : buffer2;
    data.size   = read_size;
    parity      = !parity;

    BARRIER_WAIT
      }

    else
      {
    for (J = 0; J < read_size; J++) ++all_counts[0][ (parity? buffer1 : buffer2)[J] ];
      }
     }

    close(current_file);
    }


    if (threads) BARRIER_WAIT


    for (I = 0; I < 256; I++)
    {
    for (J = 1; J < threads; J++)
    all_counts[0][I] += all_counts[J][I];

    fprintf(stdout, "%i\t%lu\n", I, all_counts[0][I]);
    }


    if (threads)
    {
    for (I = 0; I < threads; I++)
     {
    pthread_cancel(all_threads[I]);
    free(all_counts[I]);
     }

    free(all_counts);
    free(all_threads);
    pthread_barrier_destroy(&barrier);
    }

    else
    {
    free(all_counts[0]);
    free(all_counts);
    }


    return 0;
}



static void *thread_loop(thread_spec *spec)
{
    assert(spec);

    unsigned long *counts = spec->counts;
    unsigned int thread = spec->thread, threads = spec->threads;

    assert(counts);
    assert(threads);

    int I;

    BARRIER_WAIT


    while (1)
    {
    BARRIER_WAIT

    for (I = thread; I < data.size; I += threads) ++counts[ data.buffer[I] ];

    BARRIER_WAIT
    }


    //(we'll never get here)
    free(spec);

    return NULL;
}
Name it count.c and build it with gcc -Wall -g -O3 count.c -lpthread -o count. To run it:
Code:
$ ./count 0 filename #(no worker threads; just the main thread)
$ ./count n filename #(n worker threads)
I have a dual-core hyperthreaded i5, and for a 1GB file on the hard-drive I get the best performance with 2 threads: about 0.65s with 2 threads, 0.91s with 0 threads. With 1 thread or 4+ threads I get slightly-worse performance than with 0 threads. I get very different results without -O3, though.

edit: When piping 1GB using dd from either /dev/zero or /dev/urandom, it takes 1.3s with 2 threads and 1.8s with 0 threads.

Kevin Barry

Last edited by ta0kira; 08-06-2013 at 01:40 PM. Reason: fixed a few iteration bugs, added some NULL checks
 
1 members found this post helpful.
Old 08-06-2013, 01:52 PM   #41
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Now that is faster. I couldn't figure out how to use pthread_barrier_wait, so I resorted to join instead.

Your code is faster, especially with 3 threads, 0.315s versus 0.434s.

However, it is the same as fast when piping from a PRNG, both 142 MB/s.
 
Old 08-06-2013, 02:14 PM   #42
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by H_TeXMeX_H View Post
Now that is faster. I couldn't figure out how to use pthread_barrier_wait, so I resorted to join instead.
pthread_join waits for the thread to exit, whereas pthread_barrier_wait causes the calling thread to block until a specific number of other threads are blocked on the same barrier. In the case of my code (and template further above,) the barrier needs n+1 threads to block on it, at which point all of the threads can continue. It's like saying, "there are n+1 people in our hiking group; let's all wait here until the rest catch up."
Quote:
Originally Posted by H_TeXMeX_H View Post
However, it is the same as fast when piping from a PRNG, both 142 MB/s.
Keep in mind that the read speed is what dictates whether or not a particular number of threads is effective. It's possible that the PRNG can't generate data as quickly as the counter program can process it.

Kevin Barry

Last edited by ta0kira; 08-06-2013 at 02:24 PM.
 
1 members found this post helpful.
Old 08-06-2013, 03:01 PM   #43
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Original Poster
Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Alright, thanks again for the info, I'll have to practice using pthread_barrier_wait as it seems very useful and maybe I can optimize other code using it.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help with a simple If/Else code in C Exsiss Programming 4 10-02-2007 03:37 AM
optimize socket networking code powah Linux - Networking 0 09-26-2006 05:47 PM
Interpret this simple C code Chase_G Programming 4 04-29-2005 10:07 AM
to become giddy in simple code Hamid Moradmand Programming 2 05-18-2004 07:23 PM
How to optimize source code installation? Waldi Slackware 4 08-05-2003 01:47 PM


All times are GMT -5. The time now is 02:00 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration