LinuxQuestions.org - hwto bypass CPU memory cache?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - hwto bypass CPU memory cache? (https://www.linuxquestions.org/questions/programming-9/hwto-bypass-cpu-memory-cache-894820/)

hwto bypass CPU memory cache?

Hi! I have an application which creates some threads to run in different cores of a cpu (affinity). All threads use data from a common variable (i'll call it 'V')

AFAIK, when one of these threads change the contents of V, firstly, changes are stored in the cache (L1?) of the core where the thread is. If another thread demands a read, data is moved from cache to RAM, and then comes to another cache, of the other core.

Is there a way to bypass the cache mechanism and write the content of V direct to RAM, speeding up threads performance? Is 'mmap' a solution?

Thanks in advance!

In my experience, it does not happen that way.

When V is hot, it is in L1 cache, on one of the cores. In my experience there is not that much difference between other cache levels and being completely cold (but that might be because the CPUs I use share those other caches between cores).

When V needs to be accessed by more than one core, problems ensue. You will get cacheline ping-pong (where the cores contend on the ownership of the cache line) -- cache lines being the native 16, 32, 64 or 128 byte chunks the CPU uses in its L1 cache. Even if you manage to evict it from the cache -- I'm unsure how that could be done reliably without evicting everything -- you likely will just slow your program down, because CPU cores can share cachelines much more efficiently than they can read from RAM; they have specialized interconnects for that sort of stuff.

The correct solution is to modify your program logic, so that each core has their own private copy of the variable(s). (The speedup this causes is sometimes radical. The reason certain benchmarks imply that using processes instead of threads is faster is related to this: processes have private memory areas, whereas trivial thread implementations share variables that really should be thread-local, thus causing cacheline ping-pong et cetera; thus threaded implementations sometimes ending up slower than process-based implementations. But the real result then is of course just that it is important to have each CPU core or thread their own copies of the hot data they use.)

Usually you can just synchronize the thread-local values now and then, especially if they're for example flags. If you find different threads having to access a variable very often, with each update visible to all threads, you most likely have divided the work inefficiently among the threads. Some other work division is likely to work much better.

So you want to toss out the collective wisdom of thousands (?) of system designers and testers.
Wrong !!!.

Quote:

Originally Posted by Nominal Animal (Post 4430365)

The correct solution is to modify your program logic,

Right !!!

threads, are not, the answer.

as for messing about with the OS memory cache,
give yourself enough rope.

even if you do bypass the cache you've still only got a single main bus.
so you still have a bottleneck. the bottleneck is usually in i/o anyway.
so all these super dooper threads will have to queue up to use the variable anyway.
that's why CPUs have a cache.

Threads really aren't worth the hassle in my experience.
All you will do is make your program unintelligible, untestable and unmaintainable.

Hi Nominal Animal, thanks for the answer. The threads are a must of this application. But this is not nor the crux neither the goal of this post. I still do have doubts about cache, cacheline and prefetch behavior of hardware and compiler. I did a test which points toward different performance using standard allocation (new/delete[]) and mmap.

I have a quadcore here running this: each core has three threads accessing a common buffer. Each thread also stores in the buffer a counter of how many loopings it has done accessing the buffer. In one case, the buffer was allocate by mmap, and in another, standard new/delete[]. During the running, there is no dynamic allocation; only once at the beginning.

After 3 seconds, the application stops and the counters are summed up. With mmap, there were 24 million accesses, and with virtual memory, only 20 million. What happened here? Cache write-through?

Quote:

Originally Posted by conconga (Post 4434163)

Hi Nominal Animal, thanks for the answer. The threads are a must of this application.

Of course. I did not mean to abandon threads. I just pointed out the performance issues that hurt a lot of threaded code.

Quote:

Originally Posted by conconga (Post 4434163)

After 3 seconds, the application stops and the counters are summed up. With mmap, there were 24 million accesses, and with virtual memory, only 20 million. What happened here? Cache write-through?

No. If you trace your program, you will see it uses mmap() in both cases. I mean, new and malloc() internally use mmap() anyway.

What you are seeing is a cacheline alignment effect. mmap() gives you a page-aligned pointer. malloc() and new store internal library data, and return a pointer usually only aligned to 8 or 16 bytes.

In this specific case, you should make sure each counter is on a separate cache line. The counters are very hot, but private to each thread (at least most of the time). If the counters are on separate cache lines, each thread gets to keep their own counter on the cache belonging to the CPU core it is running on. If there is more than one counter, or a counter and data accessed by another thread, the cache line will bounce between those cores, causing the performance degradation you see.

If you can arrange your data so that data accessed (mostly) by one thread is on a separate cache line (to data accessed by any other thread), you will see a speedup. The 20% difference you stated is not a lot, so I suspect your program heavily suffers from cache lines bouncing between cores.