How can I get maximum performance on a multi-processor machine?
I have a multi-threaded app using pThreads that runs great under Windows on my 4 core machine - all four cores get maxed out processing parts of a large file. I recompiled the same code to run on Red Hat linux on a 64 CPU machine - but from I can tell when it runs - it gets stuck on one core. "mpstat -P ALL" shows the cores are barely loaded. I have tried sched_affinity, sched_priority and SCHED_FIFO - nothing has helped. Any ideas on getting more performance?
|
Hi -
It sounds like your processing is largely CPU bound on Windows, and the CPU workload is equally partitioned among your available cores. Fair enough. It *doesn't* sound like *any* of the CPUs are doing much work on Linux. My first guess is that maybe you're doing I/O inefficiently on Linux: the program is spending more time waiting for data to process, than it is actually processing it. An alternate guess is maybe you've maxed out RAM, you've started swapping ... and the system is doing more work swapping pages in and out than it is doing any processing work. Either way, it sounds like you're somehow, for some reason, I/O bound on Linux. Unless you see one CPU at near 100%, and the remaining CPUs idle, then you should probably be looking for some kind of I/O or memory bottleneck. IMHO .. PSM |
Have you tried it on only 4 cores on Linux ?.
You can boot the machine to only use 4 cores, or (depending on kernel level) use cgroups to limit the main task and children to a limited set (e.g. 4) of available cores. |
During re-compile, did you give the "-j64" option for the number of kernels available? (Or is that the parameter for the compilation itself? Dunno, didn't do any compiling recently...).
|
Thanks - the machine has 64Gb RAM, so I think I'm ok there - and if I was i/o bound I was expecting to see that in mpstat? There is a column for iowait, and it barely registers over 5% on each of the 64 processors. Hmmmm.... still looking
Quote:
|
Thanks - I don't have direct access - this is running RHEL, so we will look into cgroups, that's a good idea
Quote:
|
SO I don't readily see any descriptions for -j option. Any more info on that? I'll try anything - note that I cannot re-compile the kernel, this is running RHEL on a client box - Thanks!
Quote:
|
No, not re-compiling the kernel, I meant this:
Quote:
|
hmmm... From what I read the -j option tells gcc to compile on more than one processor. I don't have any problem compiling the app - it's running the app that is the problem. I'm trying to get the app to run on all 64 processors at once, not the compiler. Or did I misunderstand something? Thanks for the reply anyway -
Quote:
|
Q: Does "top" or any of your other tools show high CPU utilization for 1 CPU, and the others idle?
For a 64-CPU system and a truly parallelized application, CPU utilization *should* be allocated equally for each active thread. Q: Exactly how much "work" is being allocated to the 64 CPUs? It sounds like the answer - for whatever reason - is "not much". Q: Maybe this system is just such a screamer that all the work gets done without any CPU even breaking a sweat. Who knows - maybe this is the case. If so: Relax, Be Happy :) Q: Or maybe there's some kind of bottleneck occurring that's *preventing* the CPUs from getting all the work in a timely manner. That's what I was suggesting with "memory" and "I/O". Suggestion: * Write a quick'n'dirty test program that's all calculation (CPU-bound; no I/O) and see how it behaves. |
Thanks - exactly my thinking. I am getting "top" results today. The app is taking hours to run so I know there's lots of work for each CPU. I am worried about a bottleneck, so I am planning a test app. What's the laziest way to tie up a CPU? I'm thinking compute Pi or something - just looking for a trick. I'll post more results -
Quote:
|
Have a look at latencytop - not designed for this specifically but might help you find any blocking.
|
Could you post the compiler options and such that you've used building
your program? Cheers, Tink |
I'd build a single purpose OS. Any distro is just too generic.
Built it from scratch to match your use and don't install anything you don't need. |
Try htop - it displays a bar for each CPU - see if they're all high or not. There is also iotop for io.
md5sum /dev/zerowill keep a CPU busy indefinitely. Run up several and use htop to see if all the CPU's get going. Quote:
|
All times are GMT -5. The time now is 10:35 AM. |