How that is possible? CPU&memory
:confused: with simple question...
My RHEL58 box has two dual core Xeon CPUs with total 16 processors, which is confirmed by Code:
grep processor /proc/cpuinfo I thought that each processor has its own physical memory chunk, that is 96/16 = 6GB, which less than 8GB In other words, how 12 memory banks are divided between 16 cores? |
Each CPU (package) has direct access to some of the memory slots. But each CPU can access the rest of the memory indirectly through the CPU that has direct access.
That indirect access is completely transparent to the program. Each byte of ram has a system-wide unique physical address and the software simply accesses the memory (through the usual virtual to physical translation of course) and the access reaches the correct memory directly or indirectly. My newer (obviously not very new) system at work (a Dell T5500) has 9 memory slots for two physical CPUs. One CPU has direct access to 6 memory slots and one CPU has direct access to 3 memory slots. That CPU type is designed to access three memory slots in parallel. So one CPU has just the 3 it can access in parallel, while the other has two sets of 3. Quote:
Did you mean two quad core with hyperthreading? Or did you mean two eight core packages? Or what? Quote:
The cores or hyperthread processors within a CPU package all have equal access to whatever ram that package can access. I don't know whether your twelve sticks of ram are evenly divided or unevenly divided (or undivided) between the two CPU packages. But all that has only secondary impact on how memory is used by each process in Linux. An OS that is aware of direct vs. indirect access to ram might try to assign a process to the CPU package with better access to most of that process's ram. It also (and more easily) might try to use ram better accessed from the package where the process is running to fill any new requests from that process. But to the extent such efforts fail, all the ram is still accessible from processes in either package. That access just might be a little slower. |
Can you share result of:-
Code:
~$ grep 'model name' /proc/cpuinfo |
Quote:
Quote:
Code:
Intel Xeon CPU E5420 @ 2.4 GHz I guess 12 banks divided between 8 physical cores, 6 banks per quad core processor or per each of 2 CPU packages. Still puzzling relationship. The reason I got interested in this is I am looking for ways to speed up computation using programs that split a task to run it in parallel on multiple cores. The question is how much memory each core will have - or this should not be my concern at all as OS will take care of memory distribution and 96/16=6GB is sufficient to know? Or is it in fact 96/8=12GB? |
Quote:
Rather have have us guess what info to extract from /proc/cpuinfo it wouldn't be excessive to just copy/paste the entire /proc/cpuinfo into a reply here. Quote:
Quote:
|
In my experience, everything "depends." You might think you have lots-n-lots of "CPUs," especially with hyper-threading, and you might think that you've got lots-n-lots of accessible memory, but when you start to dig deeper into how your motherboard is laid-out, you find out what the difference between cheap mobo's and expensive ones actually is. :) Sometimes, having "all those cores" banging-away actually runs slower than it otherwise would, because what's actually happening is that they're competing with one another.
|
Quote:
It is absolutely true that many algorithms can be divided up among multiple threads with no significant extra work nor synchronization overhead (so you would expect linear speedup with the number of cores), but when you actually try it you find two threads take much more than half as long as one thread. Then as you increase the number of threads further, the elapsed time goes up rather than down. With enough threads (even though you have enough cores and ram for that many threads) the elapsed time may be worse than with just one thread. This effect depends on the size and structure of the CPU caches and on the memory access patterns of the algorithm. (The motherboard quality may also have some impact, but typically that is small). Contention between the threads can eliminate the benefits of having more than one thread. You might want to use oprofile or similar tool to investigate a key performance measure of your code before going to the trouble to multi-thread it (Lots of effort has been wasted by some people I work with who would not follow my advice to do that step). Non intrusive low level profilers work on a different principle than the more common profiling tools and it is information from that low level profiling that matters here. If you see an unusually high number of CPU cycles per instruction completed and you see a high cache miss rate, it is clear that splitting the work across two cores that share cache would cause a net increase in elapsed time. Even splitting across cores that don't share cache is likely to have little benefit. But if you see a low number of CPU cycles per instruction and/or you see a high level of branch misprediction (suffitient to explain high CPU cycles per instruction) then multi-threading is likely to have near linear benefits. A high number of CPU cycles per instruction with low cache misses and low branch mispredictions could indicate a large concentration of divides and or square roots in your algorithm, which could benefit a lot from multi-threading. But you really should know your algorithm before taking that view. Why do you have a large concentration of divides and or square roots? If the reason is not fundamental to the job being done, then you might be misinterpreting the profile data and have a situation that would not benefit from multi-threading and/or you might have a performance flaw in your implementation and have better potential benefits from fixing the implementation than from multi-threading. |
For algorithms implemented by others it is often not possible to examine how it was done as a compiled code is all you get. For example, I got to run a code that was said was capable of multithreading, but people who tried did not see sufficient speed up with large datasets (dozens of GB). When I run the code on my box, I saw that at some point CPU got occupied at 100%, but in the Irix mode it was only about 6% per core (100/16 ~ 6.3%). As I did not know how to load CPU at a higher rate, I split the dataset to 5 equal size chunks and launched
5 instances of the code, each with its own chunk. CPU load eventually got up to ~500% or about 30% per core and the job effectively was done 5 times faster. I guess this is one way to speed up processing, but not all datasets are possible to split like that. But if possible - how big chunks can be processed in parallel, would a rough estimate like TotalMemory/NumberOfCores make a usable guidance? If the dataset cannot be split, is there another way to load CPU at a higher rate with one instance and would that be really helpful? |
I'm still curious why grep found lines from 16 processors in your /proc/cpuinfo when you have clearly described a system with 8 cores and without hyperthreading.
Please post the full /proc/cpuinfo so we can see what is really there. Quote:
Quote:
Quote:
Quote:
Quote:
|
Quote:
Code:
yaximik@G5NNJN1 ~]$ cat /proc/cpuinfo |
Quote:
Anyway, it is absolutely clear that you have hyperthreading enabled so that each of your 8 cores pretends to be two, for a total of 16. All the online info I found says the E5420 does not have hyperthreading. Maybe that online info is wrong. Maybe you don't have E5420 CPU's. But my interpretation of the /proc/cpuinfo data you just posted is not wrong. You posted info listing 8 real cores doubled by hyperthreading into 16 apparent processors. Depending on the mix of work you run on that computer, you would probably get slightly better performance if you rebooted into the BIOS menu and found the BIOS option for hyperthreading and turned it off. But for other workloads, turning hyperthreading off would reduce the total work the system can do. I do a LOT of compiling very large projects with single threaded compilers launched by a build system that is very flexible about running multiple compilers in parallel. On a system with 8 real cores and a hyperthreading option and enough ram for 16 compiles at once, I have found 1) Running 16 compiles in parallel with hyperthreading disabled is slightly better throughput than running 8 compiles in parallel with hyperthreading disabled. 2) Running 16 compiles in parallel with hyperthreading enabled is slightly better throughput than running 16 compiles in parallel with hyperthreading disabled. But I also do some sophisticated large simulations on the same hardware (performance dominated by cache misses) with configurable thread count. With or without hyperthreading enabled, selecting 2 threads gives me better performance than selecting 1 or selecting more than 2. When selecting 2 threads, the performance is slightly better if hyperthreading was disabled than if it was enabled. That pattern will not be true of large simulation activities in general. But it is a common pattern. It matches what many other people have seen with other simulation jobs. That performance behavior of multiple single threaded compilers in building very large projects is more general. It is true across many different compilers, across many different projects, across different build systems and across Windows vs. Linux. If you have a compiler that internally makes good use of multi-threading, performance issues may be very different. But for heavy use of single threaded compilers, those performance effects are quite reproducible, including the benefits of hyperthreading. |
# 10
Intel® Xeon® Processor E5620 http://ark.intel.com/products/47925/...-GTs-Intel-QPI >>> # of Threads : 8 |
Edit: Oops! I had a stupid post here because of browser issues that didn't let me see part of the /proc/cpuinfo posted above.
I saw Code:
model name : Intel(R) Xeon(R) CPU Code:
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz I looked at some Dell T610 documentation when writing post #5. If I had looked at the E5420 documentation more closely I would have realized it is based on a totally different relationship between the CPU and ram than that in the T610, so I could be sure there was no E5420 in a T610 even without seeing the list of compatible processors that I saw later. |
The OP's /proc/cpuinfo shows the E5620 as the processor model, so I'd imagine that's what processor he really has.
|
In the above discussion, I left out the occasional situation in which it is most important to turn hyperthreading off:
Some basic libraries of support software for large numerical algorithms are internally hyperthreaded and configure themselves automatically based on the apparent number of processors. You could be using such software with a problem whose performance is dominated by cache misses. In that case use of 16 threads via hyperthreading would be horribly worse than using 8 threads. If you are in control of the number of threads in such a situation, you could select 8 (or fewer) threads even though the system seems to have 16 processors, and that reduction to 8 threads from 16 would have almost as much benefit with hyperthreading left on as it would with hyperthreading off. But with some programs you are stuck with the automatic configuration and so leaving hyperthreading on may devastate the total throughput. If the OS is aware of hyperthreading it will use one thread of every real core before using the second thread of any core. The hardware is designed so that when one thread of a core is unused, the other thread runs almost as fast as the undivided core would have run when hyperthreading was disabled. To a first approximation, using both threads of a hyperthreaded core makes each of them half as fast as the undivided core. So hyperthreading gives you twice as many processing units each half as fast. But when your code stalls a lot on things like mispredicted branches (as many compilers tend to do) then each thread will be much better than 50% as fast as an undivided core so total throughput is improved by hyperthreading. The same is true if the two processes tend to have very different kinds of stalls from each other: Such as one stalling on something like excess floating point divides and sqrts, while the other isn't using floating point at all. But if both threads are stalling mainly on cache misses then each thread will be much worse than 50% of an undivided core. There is hardly any limit on how much worse, because the raw cache miss rate is increased in addition to the two cores contending on the same resource. Each thread could easily be slower than 10% the speed of an undivided core. |
All times are GMT -5. The time now is 04:22 AM. |