SMP kernel performance not good?

kornelix · 03-17-2006, 08:05 AM

I constructed a trivial benchmark to measure SMP performance. Each thread of the benchmark does the following:

while (elapsed_time < 5 seconds)
optain a lock via pthread_mutex_lock()
increment a counter
release the lock via pthread_mutex_unlock()

I launched 1 to 9 threads contending for the same mutex and
measured the counter value, per thread and total. I did this
using a uniprocessor kernel which only uses one CPU, and an
SMP kernel which uses both CPUs in my dual-core processor.

Results (million iterations per second):

CPUs threads total rate thread rates
---- ------- ---------- ------------
1 1 27
1 2 27
1 4 27
1 9 27 2.2 to 3.9
2 1 27
2 2 6
2 4 7
2 9 6 0.3 to 1.2

CONCLUSIONS:
- SMP performance << uniprocessor performance
- allocation of CPU to threads is very uneven

Hardware: AMD64X2 3800+ (dual core)
Kernel versions (Red Hat Fedora Core 4):
1 CPU: uniprocessor kernel 2.6.15-1.1830_FC4
2 CPUs: SMP kernel 2.6.15-1.1830_FC4smp

Note that the SMP kernel performs the same as the uniprocessor kernel with one thread, but runs at 1/4 speed with two threads. Apparently the locking and thread switching overhead is large (hundreds of instructions). Why is SMP with 2 threads << uniprocessor with 2 threads?

Does anyone have any insights to offer about this?

If this is not an appropriate place for such a question, where is one?

thanks

foo_bar_foo · 03-17-2006, 12:44 PM

this is with or without SMT

a bunch of blocked threads all fighting at random over a mutex lock is a very odd benchmark to say the least.
more threads running at once making thing more blocked makes perfect sense.

kornelix · 03-18-2006, 02:07 AM

Quote:

Originally Posted by foo_bar_foo

this is with or without SMT

a bunch of blocked threads all fighting at random over a mutex lock is a very odd benchmark to say the least.
more threads running at once making thing more blocked makes perfect sense.

I wanted to benchmark the OS performance of thread switching.
How would you have done it?
What is SMT?

Here is a repeat of the results using a fixed font to make it legible.
1 CPU means uniprocessor kernel.
2 CPUs means SMP kernel with 2 CPUs running.
Units are millions of thread switches per second.

Code:

CPUs threads total rate thread rates
---- ------- ---------- ------------
 1      1        27
 1      2        27
 1      4        27
 1      9        27      2.2 to 3.9
 2      1        27
 2      2        6
 2      4        7
 2      9        6       0.3 to 1.2

foo_bar_foo · 03-18-2006, 02:20 PM

SMT is a kernel extension for SMP designed for better scheduling on intel dual core as opposed to actual dual processors. (processor switching is severly limited)
SMT might not be apropriate for AMD ?? i think amd dual core is actually two prcessors with seperate cache (seperate bus ?) who knows.

not sure about the other
what it seems you may be measuring -- not sure -- is non realtime context switching overhead or latency plus scheduling latency. obviously the latency
will increase from the schedular as the cue becomes larger.
2 cues instead of 1 ? it's an interesting idea.
rather than mutex lock you might try reading and writing something like tomeofday to and from a pipe ? just a thought.

i would also try fork instead of thread to see the difference
like i said before because dual core chips are generally (at least) intel using the same memory cache and since threads as opposed to process do not actually get a new copy of address space what you might be seing also is cpu afinity ?
if you are using SMT scheduling in the kernel to prevent thrashing the schedulat might be sceduling all threads on the same cpu basically.

it is an interesting line of inqury thats for shure.

foo_bar_foo · 03-18-2006, 02:51 PM

i thought a second more on this
when you add multiple processors to the equation the lock itself becomes more complex because there are 2 run ques. in general i would imagine and i have not inverstigated this but most likly the lock istelf becomes 2 locks at that point. it is quite possible that when two concurrent threads are running and one finds the lock holder currently running the thread will not block and sleep but spin and eat up cpu energy waiting. where with one cpu this will not happen. all threads scheduled that find a lock whith just one cpu will just block and go back to sleep because the lock holder is not even running at the time (no need to wait)
the way this will show up is depending on the overhead of the context switching
now being avoided vs the length of time the lock is held.

you might also for fun try it with RT schedular

KimVette · 03-18-2006, 08:02 PM

Quote:

Originally Posted by foo_bar_foo

SMT is a kernel extension for SMP designed for better scheduling on intel dual core as opposed to actual dual processors. (processor switching is severly limited)

No, SMT is hyperthreading.

Dual core CPUs do REAL SMP. They are "actual dual processors" - just on one die. The drawback cited with dual cores is a bottleneck to the memory bus when both processors need to fetch data from RAM.

Quote:

SMT might not be apropriate for AMD ?? i think amd dual core is actually two prcessors with seperate cache (seperate bus ?) who knows.

Let's straighten this out:

Intel dual core processors = two actual processors on one die
AMD dual core processors = two actual processors on one die
Intel quad core processors (announced) = four actual processors on one die

Hyperthreading(ht, or SMT) = two "virtual" processors handled by one actual processor (and yes, some Intel dual cores are also hyperthreading, which gives you two "actual" processors, but four "virtual" processors, or with the upcoming quad-core Xeons, eight "virtual" processors)

As far as I know, no AMD processors do hyperthreading, which is arguably not a disadvantage.

Last time I checked was a month ago, I haven't checked out AMD's roadmap. Honestly, SMT was designed to work around the Pentium 4's inherent design flaws, and since they've been going back to Pentium III technology with their latest Centrino family (Pentium M), I don't think that hyperthreading will be around for much longer, either that, or it's going to be significantly different in implementation if it makes it to the Pentium M line.

Quote:

not sure about the other

I don't mean to be
what it seems you may be measuring -- not sure -- is non realtime context switching overhead or latency plus scheduling latency. obviously the latency
will increase from the schedular as the cue becomes larger.

I don't mean to be crass, but you need to look up some terms before you start using them. Look up latency, thread scheduling, and context switching before you use those terms. Also look up cue vs. queue.

Quote:

2 cues instead of 1 ? it's an interesting idea.

I think your post is a cue that you should check out wikipedia or howstuffworks, and there won't be much of a queue there, so you should be able to get right into those sites.

(just demonstrating cue vs. queue here, in an attempt to be funny, don't read this as a flame please!)

Quote:

rather than mutex lock you might try reading and writing something like tomeofday to and from a pipe ? just a thought.

i would also try fork instead of thread to see the difference
like i said before because dual core chips are generally (at least) intel using the same memory cache

Again, flat-out wrong. Go read ANY Intel specs, including marketing slicks intended for laypersons. You will see that the dual-cores' processors' have independent L1 caches and independent L2 caches, just like any halfway-intelligent multiple-core processor implementation should have. What they DO share is a common bus to system memory, which is an inherent weakness of nearly any multicore chip. Still, it's a vast improvement due to a literal doubling of processor power, because it means now in a dual-processor socket/slot system you can achieve true quad-processing.

Quote:

and since threads as opposed to process do not actually get a new copy of address space what you might be seing also is cpu afinity ?

Another term for you to look up: affinity.

Affinity has nothing to do with address space. Affinity = which processor a thread is assigned to - and that assignment might not even be static, FYI. The OS's thread scheduler may decide that thread x8abe62 is on CPU1 for one cycle, and might move it to another processor for the next cycle, and you will never know it, because there is no reason for you to know unless you're the kernel or the MMU. Transparency is the whole point of SMP (and SMT), so you can focus on coding your application and managing your own threads rather than worrying about optimizing the OS's management of the actual scheduling.

Memory/address space = where in system RAM your program's data (executable, variable/pointer data, etc.) is located.

Quote:

if you are using SMT scheduling in the kernel to prevent thrashing the schedulat might be sceduling all threads on the same cpu basically.

it is an interesting line of inqury thats for shure.

If you're trying to drive the thread management and override the kernel's thread manager, you're heading for a race condition at best, or data corruption and/or kernel panic at worst.

foo_bar_foo · 03-19-2006, 02:39 AM

Quote:

No, SMT is hyperthreading

yea so i got the terminology wrong and didn't use the add buzz.
i still said check if this is on in your kernel cause it ain't right for AMD
this is a totally valid point since he is asking about thread switching and not CPU differences. After all this is the software section and not the hardware section.
you wrote 16 lines over a use of the wrong marketing phrase -- roll eyes
basically your entire post is about these chip differences when the original post is about kernel thread scheduling. Do you have any insight into kernel thread sheduling in relation to mutex lock ? must not.

Quote:

Originally Posted by KimVette

I don't mean to be crass, but you need to look up some terms before you start using them. Look up latency, thread scheduling, and context switching before you use those terms.

yea i think you forgot to point out how you thought this was in error. it's fun to have a big offense when you feel threatened (look stuff up )but substance would help as well.
is there some way that context switching as the threads get woken up and have to make the transition to runnable is not latency ? plus scheduling latency. that would be for you (context switching + scheduling latency) which would go up as the list of threads gets longer. please show how that is a wrong assumption. this would be the time it would take for a thread to respond when the mutex lock goes off right ??
that would eat into his things he is counting while (elapsed_time < 5 seconds).
right. please if i am wrong point out exactly how or keep quiet !
please be detailed about how this is not correct because you have wasted our time otherwise with useless sarcasm. you would do well to look up the term sarcasm.
and wasting the time of others. before you make further posts.

Quote:

Originally Posted by KimVette

I think your post is a cue that you should check out wikipedia or howstuffworks, and there won't be much of a queue there, so you should be able to get right into those sites.

(just demonstrating cue vs. queue here, in an attempt to be funny, don't read this as a flame please!)

Another term for you to look up: affinity.

Affinity has nothing to do with address space. Affinity = which processor a thread is assigned to - and that assignment might not even be static, FYI. The OS's thread scheduler may decide that thread x8abe62 is on CPU1 for one cycle, and might move it to another processor for the next cycle, and you will never know it, because there is no reason for you to know unless you're the kernel or the MMU. Transparency is the whole point of SMP (and SMT), so you can focus on coding your application and managing your own threads rather than worrying about optimizing the OS's management of the actual scheduling.

Memory/address space = where in system RAM your program's data (executable, variable/pointer data, etc.) is located.

If you're trying to drive the thread management and override the kernel's thread manager, you're heading for a race condition at best, or data corruption and/or kernel panic at worst.

ummmmmmm thats all real kute and all but the guy was trying to benchmark
thread switching.
is there anything at all in what you wrote about that -- NO.
not one single word did you write about thread switching. You actually just said, and i have trouble believing anyone could actually be this stupid, that a new thread or proccess and which CPU it gets scheduled on has nothing to do with if it gets its own memory space or shares a memory space.
CPU Affinity has everything to do with address space. The basic idea behind CPU affinity is to keep CPU specific data from needing to be coppied to another CPU thus displacing its cache and wasting overhead. Perhaps this would be where you could explain exactly how this is not true in the Linux kernel and please provide us with relavent code snippets because we are all on the edge of our chairs.
perhaps you could spend a little time reading the code that came with your linux kernel so you can understand what happens when threads contend for a mutex lock or when the schedular decides or not to switch a process to another CPU. Then you could actually contribute somethiing usefull to the discussion rather than being an ass. Perhaps you cold post us some code you wrote to benchmark thread switching.

Quote:

The OS's thread scheduler may decide that thread x8abe62 is on CPU1 for one cycle, and might move it to another processor for the next cycle, and you will never know it,
because there is no reason for you to know unless you're the kernel or the MMU. Transparency is the whole point of SMP (and SMT)

well yes you have shown very clearly that the entire situation is hidden from your view
thats for sure but some of us are interested in how the kernel actually works not in how we don't know how it works like you.

foo_bar_foo · 03-19-2006, 02:49 AM

before KimVette lauches another bizarre post on this thread may i remind that the original question was

Quote:

Why is SMP with 2 threads << uniprocessor with 2 threads?

Does anyone have any insights to offer about this?

if you have nothing to say on this point then you should move on to a thread you can actually comment on.

foo_bar_foo · 03-19-2006, 12:20 PM

hey i wanted to share my final conclusion on this and thanks for asking such an interesting question.
after more thought i think the benchmark is quite valid.
and as you point out the results are very odd.
this is what i think is going on.
when two concurrent threads are running SMP (i know this is how Solaris kernel works but not sure about Linux) in order to avoid the overhead for the context switch the thread without the lock spins instead of blocks. The spinning does a level of constant requests accross the bus to check mutually exclusive lock state. state of course cant be held in cache for obious reasons. These requests accross the bus are interupting the updates to the counter in the running thread. does that seem reasonable ? do the updates to the counter have to go accross he bus.

this is all i can think of that would cause this.

conclusion: dual core processors just like their hyperthreading cousins kind of suck to the point of being useless.

it may be in the future the kernel people will create a work araound for this issue and it is not outrageous for you to tell then it exists in case they have not notices.

Wells · 03-20-2006, 08:41 AM

Quote:

Originally Posted by foo_bar_foo

conclusion: dual core processors just like their hyperthreading cousins kind of suck to the point of being useless.

it may be in the future the kernel people will create a work araound for this issue and it is not outrageous for you to tell then it exists in case they have not notices.

I do have a question concerning the makeup of the kernel being used...

Is NUMA on or off with the kernel? Also, is NUMA turned on or off in the bios? I have had some experience with performance concerning NUMA being either on or off with multi-core systems, and my experience has been that the handling of NUMA in the linux kernel leaves much to be desired at this time.

foo_bar_foo · 03-20-2006, 06:12 PM

this is interesting
certainly this system is not NUMA architecture is it ?
NUMA certainly would begin to explain the bus clog in a different way.

Wells · 03-21-2006, 09:34 AM

I know that the dual core dual processor machines we have seem to support NUMA architectures, or at least some form of it.

foo_bar_foo · 03-21-2006, 06:15 PM

fascinating.
one processor writing to a same node but further away than the other and the closer will win the battle every time