Is this a good sample program for testing performance using multiple threads vs singl
I am learning about threads and how to use them in my programs. I want to have a small program that would do the same without using threads and using threads and compare their performance. I wrote a program that computes factorial of 20 one hundred million times.
Without threads: Code:
#include <stdio.h> Code:
#include <pthread.h> without threads: Quote:
Quote:
|
I would rather use exactly the same code and only enter the number of threads used. Next I would do a more difficult calculation (I would say the Mandelbrot set - that is really complex, but more interesting - or the life simulation, which can be implemented relatively easily)
|
What do you mean by using the same code? For example, compute factorial of 20 one hundred million times in every thread separately instead of fifty million times in each?
|
no, I mean you will have only one executable and you will enter the number of threads, something like this:
time app 1 # app will use only one thread time app 6 # app will use 6 threads and time will report the running time the goal is to complete exactly the same job (using one or more threads) |
No it is not really right.
It doesn't do anything and you do not test that it is correct and doing the same. Also you are passing in a local address of a stack variable, which is AFAR a no-no. (I think) unsigned long long t = count / 2; |
It makes no sense calling pthread_exit from your main. You should still return a small int from (0)
You should be joining your threads. You're not properly testing anything if you don't. Joining the threads removes any possible problem with passing a stack variable in the main. |
1 Attachment(s)
I got interested, here is an example you can look at if you like, i've changed it to just sum from 1..n to keep the numbers from overflowing.
It's a bit verbose telling what each result is, there is also a check at the end for the correct result. It's a bit of a mess (I am supposed to be working ;-) example: Code:
one thread Code:
#include <pthread.h> |
that looks much better, just sum (or simple multiplication) is quite fast (remember CPU is at about 2 GHz, that means sum up 1 000 000 000 numbers will take about a second. So we need something else (and that should be able to run also using 1 or more threads).
|
bigearsbilly has the right idea. You want to divide the workload among the threads and then combine the data once they are done. This is the most efficient way of doing it. Sometimes you have to use mutexes when the threads are working on the same data, but this has significant overhead.
Although pthreads is great and has low overhead, in some cases using openmp is easier to implement and just as fast and it has special features. https://computing.llnl.gov/tutorials/openMP/ I have used both and they are both useful. You just need to decide which one is easier and better to use for a project. |
Interesting, on a crappy old dual core Athlon:
2 threads completes in about half the time of 1 thread (wall clock) 3 threads we go back to the single thread time. 19 threads we go back to the 2 thread time again!! (wtf??) But more interesting still are some other metrics: The user time is roughly the same for 2 and 1 thread (using both CPUs i guess) so you are effectively burning the same amount of CPU time just spread out. Interestingly are the context switches 1 thread is 190 2 is 1300 19 is ~5000 ! It makes me wonder then if you have a loaded system (unlike mine) if threads would prove to be inefficient and counter productive with all thos context switches going on. More experiments to come, I have a quad core celeron coming soon. Code:
$ cat sum.1420741073.log edit: notice the user time for 3 threads is really high, so very ineficient but look at 19 threads we see that the user time is low again! I think maybe as with a lot of threads the data is divided up small enough to do in fewer moves. very odd. |
bigearsbilly this is not really good measurements because the os runs a lot of other apps (daemons, gui, whatever), and therefore the user times/elapsed times highly depends on the load of the box. You need to take this into account too.
|
This might serve as a benchmark:
http://sourceforge.net/projects/rand...?source=navbar With 4 threads versus 1 thread you get 3.46 times faster. So, there is overhead no matter what. You'll never get 4 times faster. |
yeah I know that, same as all systems. Its a guideline.
If you take enough samples it smooths out. I've been here all afternoon doing it, it doesn't vary much. |
I still won't use them at work. I prefer nice simple processes communicating with stdio files.
As we are on virtual servers anyway. My system at work I can stick everything on a single machine or distribute it over multiple servers connected by NFS. Want to share memory? /dev/shm. |
This is a purely CPU-bound workload: it doesn't do any I/O. It doesn't do much moving of data around, either.
Therefore, on a 4-core system, which doesn't have anything else to do, you ought to see a slightly-less-than 4x speedup ... and you do. Each core will run a thread, regularly soaking-up full time slices as the chip gradually overheats. ;) |
All times are GMT -5. The time now is 02:43 PM. |