LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-08-2015, 02:09 AM   #1
average_user
Member
 
Registered: Dec 2010
Location: Warsaw, Poland
Distribution: Slackware
Posts: 560

Rep: Reputation: 220Reputation: 220Reputation: 220
Is this a good sample program for testing performance using multiple threads vs singl


I am learning about threads and how to use them in my programs. I want to have a small program that would do the same without using threads and using threads and compare their performance. I wrote a program that computes factorial of 20 one hundred million times.

Without threads:

Code:
#include <stdio.h>

int main()
{
        int c, n = 1;
        unsigned long long fact = 1;

        int i;
        n = 20;

        for (i = 0; i <= 100000000; ++i)
                {
                        fact = 1;
                        for (c = 1; c <= n; c++)
                                {
                                        fact = fact * c;
                                }
                }

        return 0;
}
With threads:

Code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

void *first(void *arg)
{
        unsigned long long *count = (unsigned long long *)arg;
        unsigned long long fact = 1;
        int c = 1;
        int n = 20;
        unsigned i;;

        for (i = 0; i <= *count; ++i)
                {
                        fact = 1;
                        for (c = 1; c <= n; c++)
                                {
                                        fact = fact * c;
                                }
                }
}

void *second(void *arg)
{
        unsigned long long *count = (unsigned long long *)arg;
        unsigned long long fact = 1;
        int c = 1;
        int n = 20;
        unsigned int i;

        for (i = 0; i <= *count; ++i)
                {
                        fact = 1;
                        for (c = 1; c <= n; c++)
                                {
                                        fact = fact * c;
                                }
                }
}

int main (void)
{
        pthread_t one;
        pthread_t two;
        unsigned long long count = 100000000;

        unsigned long long t = count / 2;
        pthread_create(&one, NULL, first, (void *)&t);
        pthread_create(&two, NULL, second, (void *)&t);

        pthread_exit(NULL);
}
I compiled both these programs using gcc without -O. Execution time given by bash time utility on my very slow netbook with 2 core AMD C-50 CPU are as follows:

without threads:
Quote:
$ time ./main

real 0m34.708s
user 0m34.645s
sys 0m0.001s
with threads:

Quote:
$ time ../../pthreads

real 0m18.950s
user 0m34.570s
sys 0m0.046s
On Hostgator shared account results are 14s for a version without threads and 7s with threads. I wonder if this is a good example for testing performance using threads. These 2 programs are doing basically the same the same number of times, right? Did I miss something, some OS/compiler hidden optimization or is this a good example what threads can do? I am learning and I want to make sure that there are some places in real life uses when threads can make your program 2x faster. I realize that there will be many issues with locking access to resources, racing conditions etc. in serious programs but I don't want to worry about them at the moment.
 
Old 01-08-2015, 02:21 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,687

Rep: Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274
I would rather use exactly the same code and only enter the number of threads used. Next I would do a more difficult calculation (I would say the Mandelbrot set - that is really complex, but more interesting - or the life simulation, which can be implemented relatively easily)
 
Old 01-08-2015, 02:30 AM   #3
average_user
Member
 
Registered: Dec 2010
Location: Warsaw, Poland
Distribution: Slackware
Posts: 560

Original Poster
Rep: Reputation: 220Reputation: 220Reputation: 220
What do you mean by using the same code? For example, compute factorial of 20 one hundred million times in every thread separately instead of fifty million times in each?
 
Old 01-08-2015, 02:36 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,687

Rep: Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274
no, I mean you will have only one executable and you will enter the number of threads, something like this:
time app 1 # app will use only one thread
time app 6 # app will use 6 threads
and time will report the running time
the goal is to complete exactly the same job (using one or more threads)
 
Old 01-08-2015, 06:26 AM   #5
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
No it is not really right.
It doesn't do anything and you do not test that it is correct and doing the same.

Also you are passing in a local address of a stack variable, which is AFAR a no-no. (I think)

unsigned long long t = count / 2;
 
Old 01-08-2015, 08:51 AM   #6
SoftSprocket
Member
 
Registered: Nov 2014
Posts: 399

Rep: Reputation: Disabled
It makes no sense calling pthread_exit from your main. You should still return a small int from (0)

You should be joining your threads. You're not properly testing anything if you don't.

Joining the threads removes any possible problem with passing a stack variable in the main.
 
Old 01-08-2015, 09:51 AM   #7
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
I got interested, here is an example you can look at if you like, i've changed it to just sum from 1..n to keep the numbers from overflowing.
It's a bit verbose telling what each result is, there is also a check at the end for the correct result.

It's a bit of a mess (I am supposed to be working ;-)

example:
Code:
one thread
$ time ./sum -n 1  500010000 
start = 0 end = 500010000 result = 125005000300005000
total=125005000300005000
answer=125005000300005000
    0m2.39s real     0m2.36s user     0m0.00s system

multiple numbers
$ time ./sum -n 1  100 100  
start = 0 end = 100 result = 5050
total=5050
answer=5050
start = 0 end = 100 result = 5050
total=5050
answer=5050
    0m0.00s real     0m0.00s user     0m0.00s system

19 threads
$ time ./sum -n 19  500010000
start = 315795780 end = 342112094 result = 8656855426582155
start = 210530520 end = 236846834 result = 5886661685865255
start = 263163150 end = 289479464 result = 7271758556223705
start = 447377355 end = 473693669 result = 12119597602478280
start = 289479465 end = 315795779 result = 7964306991402930
start = 105265260 end = 131581574 result = 3116467945148355
start = 421061040 end = 447377354 result = 11427049167299055
start = 52632630 end = 78948944 result = 1731371074789905
start = 473693670 end = 500010000 result = 12812154037817385
start = 368428410 end = 394744724 result = 10041952296940605
start = 0 end = 26316314 result = 346274204431455
start = 394744725 end = 421061039 result = 10734500732119830
start = 26316315 end = 52632629 result = 1038822639610680
start = 157897890 end = 184214204 result = 4501564815506805
start = 342112095 end = 368428409 result = 9349403861761380
start = 236846835 end = 263163149 result = 6579210121044480
start = 78948945 end = 105265259 result = 2423919509969130
start = 184214205 end = 210530519 result = 5194113250686030
start = 131581575 end = 157897889 result = 3809016380327580
total=125005000300005000
answer=125005000300005000
    0m1.40s real     0m2.69s user     0m0.01s system



Code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

typedef  unsigned long long ULL;

typedef struct {
                ULL start;
                ULL end;
                ULL result;
} fdata;

void data_dump(fdata * A)
{
        printf("start = %llu end = %llu result = %llu\n",
                A->start, A->end, A->result);
}

typedef void * (*thread)(void *);
void * sum(fdata *A)
{
        A->result = 0;
        ULL i;

        for (i = A->start; i <= A->end; ++i) { 
                A->result += i;
        }
        data_dump(A);
        return NULL;
}
int moin (int nr_threads, ULL number)
{
    int n;
    ULL total = 0;

    pthread_t threads[nr_threads];
    fdata A[nr_threads];


    for ( n=0; n < nr_threads; n++) {
        ULL gap = number / nr_threads;
        A[n].start = n * gap;
        A[n].end = A[n].start + gap -1;
        pthread_create(&threads[n], NULL, (thread) sum, A+n );
    }
    A[n-1].end = number;

    for ( n=0; n < nr_threads; n++) {
        pthread_join(threads[n], NULL);
    }

    for ( n=0; n < nr_threads; n++) {
        total += A[n].result;
    }


    printf("total=%llu\n", total);
    printf("answer=%llu\n", (1 + number) * number / 2);
    return 0;
}
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
static char * progname;

void usage(char * message)
{
    printf("\n\tDoh! %s\n\n", message);
    printf("\n\tcalculate sum from 1..n using varying amount of threads, default is 1\n");
    printf("\n\tusage: %s -n # n [n1 ...]\n", progname);
    printf("\n"
    "\t -n number of threads\n"
    "\n");

    exit(1);
}

int main (int argc, char ** argv, char ** envp) 
{

    int ch;
    char *p, ** pp = argv;
    progname = *pp;
    int nr_threads = 1;

    if (argc == 1) usage("no args");
    while((ch = getopt(argc, argv, "n:")) != -1) {

             switch(ch) {

             case 'n':
                     nr_threads = atoi(optarg);
                     break;


             default:
                    usage("Doh!");

             }
    }

    for ( pp += optind; (p = *pp); ++pp) {
        moin(nr_threads, atoll(p));
    }


     return 0;

}
Attached Files
File Type: txt sum.txt (1.9 KB, 9 views)

Last edited by bigearsbilly; 01-08-2015 at 10:36 AM.
 
1 members found this post helpful.
Old 01-08-2015, 10:51 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,687

Rep: Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274
that looks much better, just sum (or simple multiplication) is quite fast (remember CPU is at about 2 GHz, that means sum up 1 000 000 000 numbers will take about a second. So we need something else (and that should be able to run also using 1 or more threads).
 
Old 01-08-2015, 12:30 PM   #9
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
bigearsbilly has the right idea. You want to divide the workload among the threads and then combine the data once they are done. This is the most efficient way of doing it. Sometimes you have to use mutexes when the threads are working on the same data, but this has significant overhead.

Although pthreads is great and has low overhead, in some cases using openmp is easier to implement and just as fast and it has special features.
https://computing.llnl.gov/tutorials/openMP/

I have used both and they are both useful. You just need to decide which one is easier and better to use for a project.
 
Old 01-08-2015, 12:35 PM   #10
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
Interesting, on a crappy old dual core Athlon:

2 threads completes in about half the time of 1 thread (wall clock)
3 threads we go back to the single thread time. 19 threads we go back to the 2 thread time again!! (wtf??)

But more interesting still
are some other metrics:

The user time is roughly the same for 2 and 1 thread (using both CPUs i guess) so you are effectively burning the same amount of CPU time
just spread out.
Interestingly are the context switches 1 thread is 190 2 is 1300 19 is ~5000 !

It makes me wonder then if you have a loaded system (unlike mine) if threads would prove to be inefficient and counter productive
with all thos context switches going on.

More experiments to come, I have a quad core celeron coming soon.


Code:
$ cat sum.1420741073.log
        Command being timed: "./sum -n1 987654321 987654321 987654321"
        User time (seconds): 13.76
        System time (seconds): 0.00
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.79
        Minor (reclaiming a frame) page faults: 299
        Voluntary context switches: 4
        Involuntary context switches: 190

        Command being timed: "./sum -n2 987654321 987654321 987654321"
        User time (seconds): 14.23
        System time (seconds): 0.00
        Percent of CPU this job got: 181%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.84
        Maximum resident set size (kbytes): 1008
        Minor (reclaiming a frame) page faults: 301
        Voluntary context switches: 7
        Involuntary context switches: 1328

        Command being timed: "./sum -n3 987654321 987654321 987654321"
        User time (seconds): 26.26
        System time (seconds): 0.02
        Percent of CPU this job got: 190%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.77
        Maximum resident set size (kbytes): 1020
        Minor (reclaiming a frame) page faults: 304
        Voluntary context switches: 6
        Involuntary context switches: 3271


        Command being timed: "./sum -n19 987654321 987654321 987654321"
        User time (seconds): 14.63
        System time (seconds): 0.02
        Percent of CPU this job got: 194%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1100
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 398
        Voluntary context switches: 30
        Involuntary context switches: 4837

edit: notice the user time for 3 threads is really high, so very ineficient but look at 19 threads we see that the user time is low again! I think maybe as with a lot of threads the data is divided up small enough to do in fewer moves.
very odd.

Last edited by bigearsbilly; 01-08-2015 at 12:43 PM.
 
Old 01-08-2015, 12:41 PM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,687

Rep: Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274Reputation: 7274
bigearsbilly this is not really good measurements because the os runs a lot of other apps (daemons, gui, whatever), and therefore the user times/elapsed times highly depends on the load of the box. You need to take this into account too.
 
Old 01-08-2015, 12:47 PM   #12
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
This might serve as a benchmark:
http://sourceforge.net/projects/rand...?source=navbar

With 4 threads versus 1 thread you get 3.46 times faster. So, there is overhead no matter what. You'll never get 4 times faster.

Last edited by metaschima; 01-08-2015 at 12:49 PM.
 
Old 01-08-2015, 12:55 PM   #13
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
yeah I know that, same as all systems. Its a guideline.
If you take enough samples it smooths out.

I've been here all afternoon doing it, it doesn't vary much.
 
Old 01-08-2015, 01:01 PM   #14
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
I still won't use them at work. I prefer nice simple processes communicating with stdio files.
As we are on virtual servers anyway. My system at work I can stick everything on a single machine or distribute
it over multiple servers connected by NFS. Want to share memory? /dev/shm.
 
Old 01-08-2015, 07:40 PM   #15
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,610
Blog Entries: 4

Rep: Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905Reputation: 3905
This is a purely CPU-bound workload: it doesn't do any I/O. It doesn't do much moving of data around, either.

Therefore, on a 4-core system, which doesn't have anything else to do, you ought to see a slightly-less-than 4x speedup ... and you do. Each core will run a thread, regularly soaking-up full time slices as the chip gradually overheats.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
What would be a good network performance & monitoring program? jbruyet Linux - Software 3 10-14-2011 02:49 PM
Good network performance/stress test program??? dbrazeau Linux - Software 3 05-18-2010 05:56 PM
need a ioctl sample program supernaturalyasir Programming 1 09-09-2009 08:12 AM
testing iptables performance testing pavan.daemon Linux - Networking 2 09-28-2007 05:22 PM
Sample Multiple Distros on 1 CD hucphin Linux - Newbie 3 11-17-2006 11:50 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:42 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration