Oprofile / Openmp strange result
I want to know whether the strange results I am seeing from Oprofile on a program using Openmp are a flaw in Oprofile (which is my best guess) or in Openmp (Intel 10 compiler) or in my program.
I am seeing wildly different numbers of Oprofile samples between threads that ought to be the same (in one function 100K samples vs. over 4 million).
The program has large sections that are single threaded and larger loops that are parallel with openmp. I set num_threads=6.
The major parallel loops each iterate at least 600K times when used and are each used at least 10K times. There is a moderate and fairly consistent amount of work in each step of each loop. So I would expect openmp to be able to very evenly distribute the work over six threads.
Across several very different loops, I see roughly the same distribution of samples across the six threads (roughly what I would expect if sampling were running at different rates on those six cores).
When I look with top, I often see the process taking 600% CPU time. If the sampling were correct, which includes two of those six threads hardly used, I would expect top to never see over 400% CPU time.
The machine has 12 actual cores, with hyperthreading enabled (I can't easily change that) so it thinks it has 24 cores. The task in question runs faster with num_threads=6 than 12 or 24. I assume that is because of cache contention.
Is hyperthreading somehow distorting the sampling? I am trying to improve the performance of this program for users who would never have hyperthreading turned on. But the machine I have available to test on does have hyperthreading on.
In functions that are entirely parallel, the first thread (the one that runs all the non parallel code) is the fourth most sampled of the six threads. It is over ten times the smallest and under a third of the largest.
My best guess is that the parallel loops are evenly distributed across the six threads and I don't need to fix that in my program or in Openmp. That means Oprofile is giving me wildly distorted info. So how do I use profiling to find the parts of the program to improve?
Other profiling tools that instrument call points and give call graph timing are useless for this project. Only random sampling based profiling has any chance of giving useful results, but only if I understand whatever systematic distortions are present.
Edit: I got IT to turn off hyperthreading on an identical system and the reporting of Oprofile samples by thread is even more distorted on that system (still using num_threads=6).
Last edited by johnsfine; 02-27-2014 at 12:35 PM.
|