Notes regarding whatever is catching my interest, which is normally technology, more often than not dealing with virtualization or performance testing. Of course in all likelihood, anything useful I might have to say will get buried in the useless things I find funny or irritating and since it's a blog I have no shame about inflicting any of that on you.

NUMA, Linus and the Ancient Conundrum

Posted 03-27-2015 at 07:06 PM by dijetlo
Updated 03-28-2015 at 07:01 PM by dijetlo

Tags linux kernel, numa, virtualization

Non Uniform Memory Architecture is kind of a dry subject for most. Where a bit of code or data sits in processor cache is of little importance to the average user, unless you become interested in things like virtualization (which seeks to maximize the utilization of that cache and slams face first into the issue ) or you're interested in performance testing, in which case squeezing every extra productive cycle out of the hardware becomes a pressing concern. I happen to be interested in both, so when Linus Torvalds decided to address the question at the opportunity of the release of his 4.0 candidate 5 kernel, I took notice.

Quote:

I'm still trying to think about the NUMA balancing performance regression. It may not be a show-stopper, but it's annoying, and I want it fixed. We'll get it, I'm sure.

What irks our hero is a question as old as recorded history. The area surrounding the center of power (the Capital of an Empire or the core of a processor) becomes very crowded, very quickly. Beyond the assorted hangers on and sycophants (veritable pages waiting to be flushed) there are the useful folk who make the Empire (or processor) function (hewers of wood, drawers of water, mutex variables, shared memory pages and what not.).
Getting rid of the hangers on is a matter of constant pruning, however no matter how you prune, you can't get all the useful folk close enough to the Emperor (or processor) so that efficient communication becomes possible. As result, delays and inefficiencies creep into the system (latency).
So you build roads, or cross chip interconnect technology (kinda up to you) to speed the transfer of instructions and data between the Capital and the provinces (or the processor and non-contiguous cache memory) however you then have to hire horsemen (or build on chip controllers) to manage the roads. The horsemen themselves are an inefficiency, cross chip interconnect technology even more so since nothing beats having the guy you need right there beside the throne at the exact moment you need him (optimal cache allocation).
My guess would be, Dr. Torvalds is looking to strike a balance in the short term as he waits for the chip manufacturers to design a provincial Seneschal for the outer reaches of the cache memory. After all, that's what all the cool Emperors eventually did, more or less.

Posted in Virtualization Platforms, NUMA, Kernel

Views 3048 Comments 6

« Prev Main Next »

Total Comments 6

Comments

I thought the issue surrounded the idea of numerous empires (CPUs), each with workers (data in RAM), that need to perform work at a given capital (CPU cache). Everything is happy when the workers are close to the Capitals that need their services, however, when a remote capital has need of wood choppers, and all of the wood choppers take days to get there, it is a performance hit. My understanding is that Linus designed the Linux kernel to relocate workers by predicting what capitals would need them, but no prediction algorithm is 100% accurate.

Or perhaps Linus was talking about the deeper problem that NUMA attempts to solve, that being multi-CPU RAM access. If all CPUs are on a common bus, only one can effectively use the bus at a given time, potentially starving all other CPUs. NUMA gives each processor its own memory, but when CPUs need to access each others memory there are hits to access times. A workload such as a database would suffer (and then tend to suffer performance hits on NUMA) because each CPU could possibly be working on a given section of RAM at a given time. Workloads like Virtualization, where it is possible to lock VMs to CPUs, could benefit from NUMA moreso than massive databases (Unless the databases are specifically designed to take advantage of a NUMA architecture).

I could be completely off my rocker (and it wouldn't surprise me if I were), but that was my understanding of NUMA last I dealt with it.

Posted 03-27-2015 at 07:38 PM by rocket357 rocket357 is offline

That's true, Rocket, except we're talking about shared pages (shmem) when we're talking about NUMA. That's cache memory (L1/L2), not RAM. The issues surrounding NUMA and scheduling deal more with migration of the shared pages between individual processors cache and the threads that are attempting to access them which don't always migrate with them, creating the latency issues.
Your second paragraph is dead on, btw, except we're just talking about cache, when you access pages in excess of the cache capacity, you're past anything that can be fixed through scheduling (it becomes entirely a bus issue since all locations in RAM are equally distant). Other than that, Kudos.
A kinda detailed discussion of this problem

You can NUMA pin a process to a memory location and/or processor, however that makes the linux scheduler less efficient, not more so (giving one set of subjects special access to the Emperor is how revolts and resource choking occur).
I picked up an ancient dual Xeon processor Poweredge server for some magic beans. I'm a long way from getting it ready for this kind of testing, (it's just sitting there looking at me with that "fix me" expression on it's face) but I hope to make some progress over the weekend on it. I'll be asking a lot of questions on the threads (no doubt) and maybe posting some testing results on here. I hope we get a chance to speak again on this subject.
Thanks for the comment, always a pleasure talking with you.

Posted 03-27-2015 at 08:23 PM by dijetlo dijetlo is offline

Updated 03-27-2015 at 09:04 PM by dijetlo (Sorry guys, I keep finding my mistakes... Maybe I should stop?)

Ahh, I see. I hadn't read much past the email you linked with regards to Linus' commentary on the subject. Thanks for the clarification!

Good luck with the poweredge. I'd kill for something like that right about now, and I hope to have sufficient funding shortly to be able to do a few virtualization-related projects. Would I be too forward if I asked where you managed to pick up the poweredge? Some place common, like ebay, perhaps? Forgive me if that's too forward =)

Posted 03-28-2015 at 12:24 AM by rocket357 rocket357 is offline

Not at all, Rocket,I got them on Craigslist (there are two however one only has a single processor in it so it's doomed to be a spare parts container) for $150 . It's circa 2004 ish so no support for kernel level virtualization (vt) but it's got a couple of sockets and enough hyperthreads, cache and hard drives for some serious perf'ing fun. Added bonus, they were about to be thrown out by a local charity that provides jobs for the developmentally disabled, so it was a serious "two-fer" as far as I was concerned.
I got my case plates from the Slackware store yesterday, so it's ready to begin it's new life. One drawback, the wife says it sounds like the vacuum cleaner when I fire it up so I may have to do something about that (the noise... not the wife).

Posted 03-28-2015 at 02:18 AM by dijetlo dijetlo is offline

Updated 03-28-2015 at 02:32 AM by dijetlo

My wife feels your wife's pain. My Cisco and firewall in our office makes it difficult to maintain a conversation =)

Posted 03-28-2015 at 04:19 AM by rocket357
Is that a feature or a bug?

Posted 03-28-2015 at 05:17 AM by dijetlo