LinuxQuestions.org - How that is possible? CPU&memory

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - How that is possible? CPU&memory (https://www.linuxquestions.org/questions/linux-newbie-8/how-that-is-possible-cpu-and-memory-4175446910/)

How that is possible? CPU&memory

:confused: with simple question...
My RHEL58 box has two dual core Xeon CPUs with total 16 processors, which is confirmed by

Code:

grep processor /proc/cpuinfo

The box also has 12 slots each populated with 8GB, that is 96GB.

I thought that each processor has its own physical memory chunk, that is 96/16 = 6GB, which less than 8GB

In other words, how 12 memory banks are divided between 16 cores?

Each CPU (package) has direct access to some of the memory slots. But each CPU can access the rest of the memory indirectly through the CPU that has direct access.

That indirect access is completely transparent to the program. Each byte of ram has a system-wide unique physical address and the software simply accesses the memory (through the usual virtual to physical translation of course) and the access reaches the correct memory directly or indirectly.

My newer (obviously not very new) system at work (a Dell T5500) has 9 memory slots for two physical CPUs. One CPU has direct access to 6 memory slots and one CPU has direct access to 3 memory slots. That CPU type is designed to access three memory slots in parallel. So one CPU has just the 3 it can access in parallel, while the other has two sets of 3.

Quote:

Originally Posted by yaximik (Post 4876216)

two dual core Xeon CPUs with total 16 processors

I'm missing at least a factor of two somewhere. Two physical CPUs. Two cores per CPU?? That seems like very old technology! But you have a modern amount of ram. Anyway, a total of four cores. If you had hyperthreading enabled, that could show up as eight processors. But you said sixteen.

Did you mean two quad core with hyperthreading? Or did you mean two eight core packages? Or what?

Quote:

Originally Posted by yaximik (Post 4876216)

I thought that each processor has its own physical memory chunk, that is 96/16 = 6GB,

That is your main confusion. In AMD designs and newer Intel designs, each CPU package has direct access to part of the memory as I described above. (In older Intel designs, both CPU packages had equal access to all of the ram).

The cores or hyperthread processors within a CPU package all have equal access to whatever ram that package can access.

I don't know whether your twelve sticks of ram are evenly divided or unevenly divided (or undivided) between the two CPU packages.

But all that has only secondary impact on how memory is used by each process in Linux. An OS that is aware of direct vs. indirect access to ram might try to assign a process to the CPU package with better access to most of that process's ram. It also (and more easily) might try to use ram better accessed from the package where the process is running to fill any new requests from that process. But to the extent such efforts fail, all the ram is still accessible from processes in either package. That access just might be a little slower.

Can you share result of:-

Code:

~$ grep 'model name' /proc/cpuinfo

OR

~$ grep 'model name' /proc/cpuinfo | wc -l

Quote:

I'm missing at least a factor of two somewhere.

Quote:

~$ grep 'model name' /proc/cpuinfo

I was wrong indeed. From the command above I got 16 entries

Code:

Intel Xeon CPU E5420 @ 2.4 GHz

Tech Guide for Dell T610 says it is a 4 core processor, so as I have two of them the total is 8 cores, but how come OS lists 16?

I guess 12 banks divided between 8 physical cores, 6 banks per quad core processor or per each of 2 CPU packages. Still puzzling relationship.

The reason I got interested in this is I am looking for ways to speed up computation using programs that split a task to run it in parallel on multiple cores. The question is how much memory each core will have - or this should not be my concern at all as OS will take care of memory distribution and 96/16=6GB is sufficient to know? Or is it in fact 96/8=12GB?

Quote:

Originally Posted by yaximik (Post 4876788)

I was wrong indeed. From the command above I got 16 entries

I checked a few sites, which all agree the E5420 does not have hyperthreading. So I don't understand why you see 16 processors.

Rather have have us guess what info to extract from /proc/cpuinfo it wouldn't be excessive to just copy/paste the entire /proc/cpuinfo into a reply here.

Quote:

Originally Posted by yaximik (Post 4876788)

I guess 12 banks divided between 8 physical cores, 6 banks per quad core processor or per each of 2 CPU packages.

The Dell T610 owners manual (online) confirms you have 6 sticks of ram for each of the two CPU packages.

Quote:

Originally Posted by yaximik (Post 4876788)

I am looking for ways to speed up computation using programs that split a task to run it in parallel on multiple cores. The question is how much memory each core will have - or this should not be my concern at all as OS will take care of memory distribution

Normally you should let the OS worry about dividing ram between processes. If one process needed 95GB and all the rest only added up to 1GB, the OS can do that, even though each CPU package has direct access to only 48GB. The indirect access is only slightly slower and is totally transparent to your code within the process (your single core process using 95GB would have no need to care that half of that is accessed through the other physical CPU packages).

In my experience, everything "depends." You might think you have lots-n-lots of "CPUs," especially with hyper-threading, and you might think that you've got lots-n-lots of accessible memory, but when you start to dig deeper into how your motherboard is laid-out, you find out what the difference between cheap mobo's and expensive ones actually is. :) Sometimes, having "all those cores" banging-away actually runs slower than it otherwise would, because what's actually happening is that they're competing with one another.

Quote:

Originally Posted by sundialsvcs (Post 4877027)

you find out what the difference between cheap mobo's and expensive ones actually is. :) Sometimes, having "all those cores" banging-away actually runs slower than it otherwise would

I don't think that is very much affected by the quality or price of the motherboard.

It is absolutely true that many algorithms can be divided up among multiple threads with no significant extra work nor synchronization overhead (so you would expect linear speedup with the number of cores), but when you actually try it you find two threads take much more than half as long as one thread. Then as you increase the number of threads further, the elapsed time goes up rather than down. With enough threads (even though you have enough cores and ram for that many threads) the elapsed time may be worse than with just one thread.

This effect depends on the size and structure of the CPU caches and on the memory access patterns of the algorithm. (The motherboard quality may also have some impact, but typically that is small). Contention between the threads can eliminate the benefits of having more than one thread.

You might want to use oprofile or similar tool to investigate a key performance measure of your code before going to the trouble to multi-thread it (Lots of effort has been wasted by some people I work with who would not follow my advice to do that step).
Non intrusive low level profilers work on a different principle than the more common profiling tools and it is information from that low level profiling that matters here. If you see an unusually high number of CPU cycles per instruction completed and you see a high cache miss rate, it is clear that splitting the work across two cores that share cache would cause a net increase in elapsed time. Even splitting across cores that don't share cache is likely to have little benefit. But if you see a low number of CPU cycles per instruction and/or you see a high level of branch misprediction (suffitient to explain high CPU cycles per instruction) then multi-threading is likely to have near linear benefits.

A high number of CPU cycles per instruction with low cache misses and low branch mispredictions could indicate a large concentration of divides and or square roots in your algorithm, which could benefit a lot from multi-threading. But you really should know your algorithm before taking that view. Why do you have a large concentration of divides and or square roots? If the reason is not fundamental to the job being done, then you might be misinterpreting the profile data and have a situation that would not benefit from multi-threading and/or you might have a performance flaw in your implementation and have better potential benefits from fixing the implementation than from multi-threading.

For algorithms implemented by others it is often not possible to examine how it was done as a compiled code is all you get. For example, I got to run a code that was said was capable of multithreading, but people who tried did not see sufficient speed up with large datasets (dozens of GB). When I run the code on my box, I saw that at some point CPU got occupied at 100%, but in the Irix mode it was only about 6% per core (100/16 ~ 6.3%). As I did not know how to load CPU at a higher rate, I split the dataset to 5 equal size chunks and launched
5 instances of the code, each with its own chunk. CPU load eventually got up to ~500% or about 30% per core and the job effectively was done 5 times faster. I guess this is one way to speed up processing, but not all datasets are possible to split like that. But if possible - how big chunks can be processed in parallel, would a rough estimate like TotalMemory/NumberOfCores make a usable guidance? If the dataset cannot be split, is there another way to load CPU at a higher rate with one instance and would that be really helpful?

I'm still curious why grep found lines from 16 processors in your /proc/cpuinfo when you have clearly described a system with 8 cores and without hyperthreading.

Please post the full /proc/cpuinfo so we can see what is really there.

Quote: