LinuxQuestions.org - Is this a good sample program for testing performance using multiple threads vs singl

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Is this a good sample program for testing performance using multiple threads vs singl (https://www.linuxquestions.org/questions/programming-9/is-this-a-good-sample-program-for-testing-performance-using-multiple-threads-vs-singl-4175530404/)

Is this a good sample program for testing performance using multiple threads vs singl

I am learning about threads and how to use them in my programs. I want to have a small program that would do the same without using threads and using threads and compare their performance. I wrote a program that computes factorial of 20 one hundred million times.

Without threads:

Code:

#include <stdio.h>



int main()

{

        int c, n = 1;

        unsigned long long fact = 1;



        int i;

        n = 20;



        for (i = 0; i <= 100000000; ++i)

                {

                        fact = 1;

                        for (c = 1; c <= n; c++)

                                {

                                        fact = fact * c;

                                }

                }



        return 0;

}

With threads:

Code:

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>



void *first(void *arg)

{

        unsigned long long *count = (unsigned long long *)arg;

        unsigned long long fact = 1;

        int c = 1;

        int n = 20;

        unsigned i;;



        for (i = 0; i <= *count; ++i)

                {

                        fact = 1;

                        for (c = 1; c <= n; c++)

                                {

                                        fact = fact * c;

                                }

                }

}



void *second(void *arg)

{

        unsigned long long *count = (unsigned long long *)arg;

        unsigned long long fact = 1;

        int c = 1;

        int n = 20;

        unsigned int i;



        for (i = 0; i <= *count; ++i)

                {

                        fact = 1;

                        for (c = 1; c <= n; c++)

                                {

                                        fact = fact * c;

                                }

                }

}



int main (void)

{

        pthread_t one;

        pthread_t two;

        unsigned long long count = 100000000;



        unsigned long long t = count / 2;

        pthread_create(&one, NULL, first, (void *)&t);

        pthread_create(&two, NULL, second, (void *)&t);



        pthread_exit(NULL);

}

I compiled both these programs using gcc without -O. Execution time given by bash time utility on my very slow netbook with 2 core AMD C-50 CPU are as follows:

without threads:

Quote:

$ time ./main

real 0m34.708s
user 0m34.645s
sys 0m0.001s

with threads:

Quote:

$ time ../../pthreads

real 0m18.950s
user 0m34.570s
sys 0m0.046s

On Hostgator shared account results are 14s for a version without threads and 7s with threads. I wonder if this is a good example for testing performance using threads. These 2 programs are doing basically the same the same number of times, right? Did I miss something, some OS/compiler hidden optimization or is this a good example what threads can do? I am learning and I want to make sure that there are some places in real life uses when threads can make your program 2x faster. I realize that there will be many issues with locking access to resources, racing conditions etc. in serious programs but I don't want to worry about them at the moment.

I would rather use exactly the same code and only enter the number of threads used. Next I would do a more difficult calculation (I would say the Mandelbrot set - that is really complex, but more interesting - or the life simulation, which can be implemented relatively easily)

What do you mean by using the same code? For example, compute factorial of 20 one hundred million times in every thread separately instead of fifty million times in each?

no, I mean you will have only one executable and you will enter the number of threads, something like this:
time app 1 # app will use only one thread
time app 6 # app will use 6 threads
and time will report the running time
the goal is to complete exactly the same job (using one or more threads)

No it is not really right.
It doesn't do anything and you do not test that it is correct and doing the same.

Also you are passing in a local address of a stack variable, which is AFAR a no-no. (I think)

unsigned long long t = count / 2;

It makes no sense calling pthread_exit from your main. You should still return a small int from (0)

You should be joining your threads. You're not properly testing anything if you don't.

Joining the threads removes any possible problem with passing a stack variable in the main.

I got interested, here is an example you can look at if you like, i've changed it to just sum from 1..n to keep the numbers from overflowing.
It's a bit verbose telling what each result is, there is also a check at the end for the correct result.

It's a bit of a mess (I am supposed to be working ;-)

example:

Code:

one thread

$ time ./sum -n 1  500010000 

start = 0 end = 500010000 result = 125005000300005000

total=125005000300005000

answer=125005000300005000

    0m2.39s real    0m2.36s user    0m0.00s system



multiple numbers

$ time ./sum -n 1  100 100  

start = 0 end = 100 result = 5050

total=5050

answer=5050

start = 0 end = 100 result = 5050

total=5050

answer=5050

    0m0.00s real    0m0.00s user    0m0.00s system



19 threads

$ time ./sum -n 19  500010000

start = 315795780 end = 342112094 result = 8656855426582155

start = 210530520 end = 236846834 result = 5886661685865255

start = 263163150 end = 289479464 result = 7271758556223705

start = 447377355 end = 473693669 result = 12119597602478280

start = 289479465 end = 315795779 result = 7964306991402930

start = 105265260 end = 131581574 result = 3116467945148355

start = 421061040 end = 447377354 result = 11427049167299055

start = 52632630 end = 78948944 result = 1731371074789905

start = 473693670 end = 500010000 result = 12812154037817385

start = 368428410 end = 394744724 result = 10041952296940605

start = 0 end = 26316314 result = 346274204431455

start = 394744725 end = 421061039 result = 10734500732119830

start = 26316315 end = 52632629 result = 1038822639610680

start = 157897890 end = 184214204 result = 4501564815506805

start = 342112095 end = 368428409 result = 9349403861761380

start = 236846835 end = 263163149 result = 6579210121044480

start = 78948945 end = 105265259 result = 2423919509969130

start = 184214205 end = 210530519 result = 5194113250686030

start = 131581575 end = 157897889 result = 3809016380327580

total=125005000300005000

answer=125005000300005000

    0m1.40s real    0m2.69s user    0m0.01s system

Code:

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>



typedef  unsigned long long ULL;



typedef struct {

                ULL start;

                ULL end;

                ULL result;

} fdata;



void data_dump(fdata * A)

{

        printf("start = %llu end = %llu result = %llu\n",

                A->start, A->end, A->result);

}



typedef void * (*thread)(void *);

void * sum(fdata *A)

{

        A->result = 0;

        ULL i;



        for (i = A->start; i <= A->end; ++i) { 

                A->result += i;

        }

        data_dump(A);

        return NULL;

}

int moin (int nr_threads, ULL number)

{

    int n;

    ULL total = 0;



    pthread_t threads[nr_threads];

    fdata A[nr_threads];





    for ( n=0; n < nr_threads; n++) {

        ULL gap = number / nr_threads;

        A[n].start = n * gap;

        A[n].end = A[n].start + gap -1;

        pthread_create(&threads[n], NULL, (thread) sum, A+n );

    }

    A[n-1].end = number;



    for ( n=0; n < nr_threads; n++) {

        pthread_join(threads[n], NULL);

    }



    for ( n=0; n < nr_threads; n++) {

        total += A[n].result;

    }





    printf("total=%llu\n", total);

    printf("answer=%llu\n", (1 + number) * number / 2);

    return 0;

}

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

static char * progname;



void usage(char * message)

{

    printf("\n\tDoh! %s\n\n", message);

    printf("\n\tcalculate sum from 1..n using varying amount of threads, default is 1\n");

    printf("\n\tusage: %s -n # n [n1 ...]\n", progname);

    printf("\n"

    "\t -n number of threads\n"

    "\n");



    exit(1);

}



int main (int argc, char ** argv, char ** envp) 

{



    int ch;

    char *p, ** pp = argv;

    progname = *pp;

    int nr_threads = 1;



    if (argc == 1) usage("no args");

    while((ch = getopt(argc, argv, "n:")) != -1) {



            switch(ch) {



            case 'n':

                    nr_threads = atoi(optarg);

                    break;





            default:

                    usage("Doh!");



            }

    }



    for ( pp += optind; (p = *pp); ++pp) {

        moin(nr_threads, atoll(p));

    }





    return 0;



}

that looks much better, just sum (or simple multiplication) is quite fast (remember CPU is at about 2 GHz, that means sum up 1 000 000 000 numbers will take about a second. So we need something else (and that should be able to run also using 1 or more threads).

bigearsbilly has the right idea. You want to divide the workload among the threads and then combine the data once they are done. This is the most efficient way of doing it. Sometimes you have to use mutexes when the threads are working on the same data, but this has significant overhead.

Although pthreads is great and has low overhead, in some cases using openmp is easier to implement and just as fast and it has special features.
https://computing.llnl.gov/tutorials/openMP/

I have used both and they are both useful. You just need to decide which one is easier and better to use for a project.

Interesting, on a crappy old dual core Athlon:

2 threads completes in about half the time of 1 thread (wall clock)
3 threads we go back to the single thread time. 19 threads we go back to the 2 thread time again!! (wtf??)

But more interesting still
are some other metrics:

The user time is roughly the same for 2 and 1 thread (using both CPUs i guess) so you are effectively burning the same amount of CPU time
just spread out.
Interestingly are the context switches 1 thread is 190 2 is 1300 19 is ~5000 !

It makes me wonder then if you have a loaded system (unlike mine) if threads would prove to be inefficient and counter productive
with all thos context switches going on.

More experiments to come, I have a quad core celeron coming soon.

Code:

$ cat sum.1420741073.log

        Command being timed: "./sum -n1 987654321 987654321 987654321"

        User time (seconds): 13.76

        System time (seconds): 0.00

        Percent of CPU this job got: 99%

        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.79

        Minor (reclaiming a frame) page faults: 299

        Voluntary context switches: 4

        Involuntary context switches: 190



        Command being timed: "./sum -n2 987654321 987654321 987654321"

        User time (seconds): 14.23

        System time (seconds): 0.00

        Percent of CPU this job got: 181%

        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.84

        Maximum resident set size (kbytes): 1008

        Minor (reclaiming a frame) page faults: 301

        Voluntary context switches: 7

        Involuntary context switches: 1328



        Command being timed: "./sum -n3 987654321 987654321 987654321"

        User time (seconds): 26.26

        System time (seconds): 0.02

        Percent of CPU this job got: 190%

        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.77

        Maximum resident set size (kbytes): 1020

        Minor (reclaiming a frame) page faults: 304

        Voluntary context switches: 6

        Involuntary context switches: 3271





        Command being timed: "./sum -n19 987654321 987654321 987654321"

        User time (seconds): 14.63

        System time (seconds): 0.02

        Percent of CPU this job got: 194%

        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.53

        Average shared text size (kbytes): 0

        Average unshared data size (kbytes): 0

        Average stack size (kbytes): 0

        Average total size (kbytes): 0

        Maximum resident set size (kbytes): 1100

        Average resident set size (kbytes): 0

        Major (requiring I/O) page faults: 0

        Minor (reclaiming a frame) page faults: 398

        Voluntary context switches: 30

        Involuntary context switches: 4837

edit: notice the user time for 3 threads is really high, so very ineficient but look at 19 threads we see that the user time is low again! I think maybe as with a lot of threads the data is divided up small enough to do in fewer moves.
very odd.

bigearsbilly this is not really good measurements because the os runs a lot of other apps (daemons, gui, whatever), and therefore the user times/elapsed times highly depends on the load of the box. You need to take this into account too.

This might serve as a benchmark:
http://sourceforge.net/projects/rand...?source=navbar

With 4 threads versus 1 thread you get 3.46 times faster. So, there is overhead no matter what. You'll never get 4 times faster.

yeah I know that, same as all systems. Its a guideline.
If you take enough samples it smooths out.

I've been here all afternoon doing it, it doesn't vary much.

I still won't use them at work. I prefer nice simple processes communicating with stdio files.
As we are on virtual servers anyway. My system at work I can stick everything on a single machine or distribute
it over multiple servers connected by NFS. Want to share memory? /dev/shm.

This is a purely CPU-bound workload: it doesn't do any I/O. It doesn't do much moving of data around, either.

Therefore, on a 4-core system, which doesn't have anything else to do, you ought to see a slightly-less-than 4x speedup ... and you do. Each core will run a thread, regularly soaking-up full time slices as the chip gradually overheats. ;)