ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
You should try your program with a huge file that's actually on the filesystem. If your process is running at less-than 100% CPU then there's a good chance you're needlessly optimizing.
Yeah, you're right, it uses less than 100%, so it may be needless optimization. However, I often pipe data from a fast PRNG to this program.
Release all threads to count bytes for assigned sections of the input buffer.
While the threads are counting, read the next section into a separate buffer.
Etc.
Sum up the counts from each of the threads and print the output.
This is something I did recently for a different sort of file analysis and it worked well. That involved looking up words from the file in a binary tree and tens-of-thousands of floating-point computations per byte, but the essential pattern is the same.
But, like I implied in my previous post, your bottleneck is probably going to be reading input fast enough, not the algorithm processing the data. I think you'd get maximum performance with one thread for reading and one thread for counting, with the read thread one iteration ahead and a pthread barrier to make sure they stay synchronized.
It's actually a little tricky to avoid race conditions when the reading and processing happen in different threads. Here's the basic model I used most recently. The main goal of this model is to make the total processing time identical to the time it takes to read the input (plus a constant for postprocessing and output.) In your case, you will probably achieve that with 1 worker thread.
Global data:
Code:
//data that will be operated on, e.g. a buffer of input data
static struct { /*todo: stuff*/ } data;
static pthread_barrier_t barrier; //each thread blocks on this until all threads have done so
Main thread:
Code:
//todo: initialize 'barrier' (barrier count needs to be #worker-threads + 1)//todo: start worker threads with 'pthread_create'
while (/*todo: read a single block of data*/)
{
//todo: preprocess the input if necessary
//first iteration: wait for the worker threads to initialize
//subsequent iterations: wait for the worker threads to finish the last iteration
pthread_barrier_wait(&barrier); //[A], [C]
//(worker threads immediately block again on the barrier at [B])
//todo: update 'data' as necessary for the next iteration, while the worker threads are stopped
//(in this case, you might have 2 buffers that you alternate assigning to 'data')
//wait for the worker threads get to the "starting line"
pthread_barrier_wait(&barrier); //[B]
//(at this point the worker threads are working, and the main thread can read the next block)
}
//wait for the worker threads to finish the last iteration
pthread_barrier_wait(&barrier); //[C], or [A] if the input is empty
//todo: postprocessing, output, etc.//todo: 'pthread_cancel' all threads, but don't 'pthread_join' since they don't actually return
Worker threads:
Code:
//todo: initialize thread data (e.g. buffers)
//wait for all other worker threads to initialize
pthread_barrier_wait(&barrier); //[A]
while (1)
{
//wait for the other worker threads get to the "starting line"
pthread_barrier_wait(&barrier); //[B]//todo: process the data here!
//wait for the other worker threads to finish this iteration
pthread_barrier_wait(&barrier); //[C]
}
I actually generalized this a bit more and used function pointers so I could use the worker threads for different tasks at different points in the code. For me, figuring out this sort of pattern was the hardest part about learning threads. I'm not sure if this is a common pattern, but it's the one I usually use and it seems to work well.
Kevin Barry
Last edited by ta0kira; 08-05-2013 at 09:49 AM.
Reason: added "todo" to make places in need of code more explicit
I have learned from it tho, it seems pthread is much much easier to use and understand than openmp.
There isn't a single "100MB" scenario. Was it read from the filesystem? Was it piped? Was it a file that you've been repeatedly testing your program on (increasing the chances of it being entirely cached in RAM)? The linear scaling you demonstrated in the other post doesn't apply to data that must be read from the filesystem. If you were piping the data, I don't doubt that the threading slowed it down, since "reading" really meant copying bytes from one RAM location to another. The point is that when you read from a disk the read operations take long enough so that all of the thread coordination and data-processing happens during those periods.
That doesn't sound right. Threading alone doesn't make things 8x slower. Can you post your code? You must be doing a malloc every iteration or something.
There are several problems with your implementation:
Creating a new set of threads every iteration will destroy your performance because you're effectively creating a new process that shares the same process image each time. You need the pthread_barrier_t because the threads are supposed to persist, and therefore must start and stop.
The code you posted will miss the end of the block if the block size isn't a multiple of the number of threads, and it will process garbage if less-than a full block is read.
You haven't initialized your count arrays.
Here's my version:
Code:
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <pthread.h>
#define BUFFER_SIZE (64*4096)
#define BARRIER_WAIT \
{ int result = pthread_barrier_wait(&barrier); \
assert(!result || result == PTHREAD_BARRIER_SERIAL_THREAD); }
static struct {
unsigned char *buffer;
unsigned long size;
} data;
typedef struct {
unsigned long *counts;
unsigned int thread, threads;
} thread_spec;
static pthread_barrier_t barrier;
static void *thread_loop(thread_spec*);
int main(int argc, char *argv[])
{
if (argc < 3)
{
fprintf(stderr, "%s [threads] [filename(s)...]\n", argv[0]);
return 1;
}
int threads = 0;
char error = 0;
ssize_t read_size = 0;
int I, J, parity = 0;
if (sscanf(argv[1], "%i%c", &threads, &error) != 1 || threads < 0)
{
fprintf(stderr, "%s: invalid thread count '%s'\n", argv[0], argv[1]);
return 1;
}
unsigned char *buffer1 = NULL, *buffer2 = NULL;
assert( buffer1 = (unsigned char*) malloc(BUFFER_SIZE) );
assert( buffer2 = (unsigned char*) malloc(BUFFER_SIZE) );
pthread_t *all_threads = NULL;
unsigned long **all_counts = NULL;
if (threads)
{
assert( pthread_barrier_init(&barrier, NULL, threads + 1) == 0 );
assert( all_threads = (pthread_t*) malloc(threads * sizeof(pthread_t)) );
assert( all_counts = (unsigned long**) malloc(threads * sizeof(unsigned long*)) );
for (I = 0; I < threads; I++)
{
thread_spec *new_spec = (thread_spec*) malloc(sizeof(thread_spec));
assert(new_spec);
assert( all_counts[I] = new_spec->counts = (unsigned long*) calloc(256, sizeof(unsigned long)) );
new_spec->thread = I;
new_spec->threads = threads;
assert( pthread_create(all_threads + I, NULL, (void*(*)(void*)) &thread_loop, (void*) new_spec) == 0 );
}
}
else
{
assert( all_counts = (unsigned long**) malloc(sizeof(unsigned long*)) );
assert( all_counts[0] = (unsigned long*) calloc(256, sizeof(unsigned long)) );
}
for (I = 2; I < argc; I++)
{
int current_file = open(argv[I], O_RDONLY);
if (current_file < 0)
{
fprintf(stderr, "%s: unable to open file '%s': %s\n", argv[0], argv[I], strerror(errno));
continue;
}
while ( (read_size = read(current_file, parity? buffer1 : buffer2, BUFFER_SIZE)) > 0 )
{
if (threads)
{
BARRIER_WAIT
data.buffer = parity? buffer1 : buffer2;
data.size = read_size;
parity = !parity;
BARRIER_WAIT
}
else
{
for (J = 0; J < read_size; J++) ++all_counts[0][ (parity? buffer1 : buffer2)[J] ];
}
}
close(current_file);
}
if (threads) BARRIER_WAIT
for (I = 0; I < 256; I++)
{
for (J = 1; J < threads; J++)
all_counts[0][I] += all_counts[J][I];
fprintf(stdout, "%i\t%lu\n", I, all_counts[0][I]);
}
if (threads)
{
for (I = 0; I < threads; I++)
{
pthread_cancel(all_threads[I]);
free(all_counts[I]);
}
free(all_counts);
free(all_threads);
pthread_barrier_destroy(&barrier);
}
else
{
free(all_counts[0]);
free(all_counts);
}
return 0;
}
static void *thread_loop(thread_spec *spec)
{
assert(spec);
unsigned long *counts = spec->counts;
unsigned int thread = spec->thread, threads = spec->threads;
assert(counts);
assert(threads);
int I;
BARRIER_WAIT
while (1)
{
BARRIER_WAIT
for (I = thread; I < data.size; I += threads) ++counts[ data.buffer[I] ];
BARRIER_WAIT
}
//(we'll never get here)
free(spec);
return NULL;
}
Name it count.c and build it with gcc -Wall -g -O3 count.c -lpthread -o count. To run it:
Code:
$ ./count 0 filename#(no worker threads; just the main thread)
$ ./count nfilename#(n worker threads)
I have a dual-core hyperthreaded i5, and for a 1GB file on the hard-drive I get the best performance with 2 threads: about 0.65s with 2 threads, 0.91s with 0 threads. With 1 thread or 4+ threads I get slightly-worse performance than with 0 threads. I get very different results without -O3, though.
edit: When piping 1GB using dd from either /dev/zero or /dev/urandom, it takes 1.3s with 2 threads and 1.8s with 0 threads.
Kevin Barry
Last edited by ta0kira; 08-06-2013 at 12:40 PM.
Reason: fixed a few iteration bugs, added some NULL checks
Now that is faster. I couldn't figure out how to use pthread_barrier_wait, so I resorted to join instead.
pthread_join waits for the thread to exit, whereas pthread_barrier_wait causes the calling thread to block until a specific number of other threads are blocked on the same barrier. In the case of my code (and template further above,) the barrier needs n+1 threads to block on it, at which point all of the threads can continue. It's like saying, "there are n+1 people in our hiking group; let's all wait here until the rest catch up."
Quote:
Originally Posted by H_TeXMeX_H
However, it is the same as fast when piping from a PRNG, both 142 MB/s.
Keep in mind that the read speed is what dictates whether or not a particular number of threads is effective. It's possible that the PRNG can't generate data as quickly as the counter program can process it.
Alright, thanks again for the info, I'll have to practice using pthread_barrier_wait as it seems very useful and maybe I can optimize other code using it.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.