[SOLVED] Critical code needs atomic execution

mbrose1994 · 05-13-2011, 01:54 PM

I need to insure the "writer" to named shared memory (small size, 160 bytes) completes his task such that the many "readers" always get the latest data.

The machine is a multi-processor (8 CPU, 24 Gig RAM), non-real time system. The system has multiple processes that are also multi-threaded. Thankfully, there is only 1 "writer" and many "readers". There is no semaphore or mutex locking. "Readers" must not block each other and especially not block the "writer".

By design, it is expected from time to time a "reader" will be in the middle of reading when the "writer" begins to update the data.

I want to protect against the case of the "writer" being interrupted and a "reader" completes it's read before the "writer" wakes up and completes it's changes. In this case the "reader" will get corrupt data, some new and some old.

Any ideas, experiences addressing this problem will be appreciated.

Nominal Animal · 05-13-2011, 07:29 PM

Use atomic builtins recent C compilers have -- I know at least GCC and Intel Compiler Collection have these. These allow the writer to modify one, two, four, or eight byte values atomically.

If you need to provide larger structures atomically, add two words, a mutex, and a condition variable into the shared memory segment. The two words are generation counters. The mutex and the condition variable are for reader wakeup.

When the writer starts a new generation:

Increase the first generation counter using __sync_add_and_fetch(&generation1,1).
Update the data normally. No need to use atomic ops.
Increase the second generation counter using __sync_add_and_fetch(&generation2,1).
Broadcast the condition variable using pthread_cond_broadcast(&cond).

When a reader wants a snapshot of the data:

Save the generation counters to two temporary variables.
If they differ, take the mutex via pthread_mutex_lock(&mutex), wait on the condition variable using pthread_cond_timedwait(&cond, &mutex, &timeout]), and restart from the beginning.
Copy the data (or needed parts of it) to local storage.
No need to use any kind of atomic operations here.
Save the generation counters to two further temporary variables.
If all four match, then the reader has a valid snapshot, and is done.
If the latter pair matches, but differs from the first pair, the reader can swap the pairs and continue from step 2.
Otherwise the reader must discard the data, take the mutex via pthread_mutex_lock(&mutex), wait on the condition variable using pthread_cond_timedwait(&cond, &mutex, &timeout]), and restart from the beginning.

The timeout should be relatively short. It is possible the writer finishes the data generation between the time a reader reads the generation counters and starts waiting on the condition variable, or there might be no writer at all. In both cases it's best if the reader will wait for only a short time, before retrying (or giving up if failing too many times in a row without any change in the generation counters). I'd let the user configure or tweak the timeout, but I'd default to 1 ms.

If you need example code, start a thread at the Programming section.

mbrose1994 · 05-14-2011, 08:49 AM

Thanks for your replay Nominal Animal.

Yes, since the writer updates 160 bytes, I can't use atomic ops.

Your answer is in principal very close to what I am testing now. Let me describe it.

In shared memory, I added two unsigned int vars, writer_In, writer_Out. These are simple sequential counters. The In is initialized to 0, the Out to 1. This difference of 1 is the key.

Every write first increments the In (by 1), updates the data, writer's last action is to increment the Out (by 1). So a completed write leaves the In and Out different by 1. This provides every reader, no matter how many, and no matter how fast or slow they read, with simple tests to verify the read is good. And as a bonus, readers can save the last In value it saw and short circuit reads of the same data it already has.

Every reader first tests the shared memory In values against it's saved In value. If the same, reader can quit, it's data is still current. Otherwise, save In value, do the read. At the end of the read, reader gets the In shared memory value again, a second time. If the reader saved In is different, the writer has gotten in there, must re-read. Otherwise reader gets shared memory Out value, does final test against In value.

These cases are handled:
1. reader reading, writer starts:
reader sees In difference == 1, knows to re-read
2. writer writing, reader completes read:
reader sees In,Out difference == 0, knows to re-read
3. writer interrupted, reader completes:
reader sees In,Out difference == 0, knows to re-read
4. reader interrupted, writer completes:
reader sees In difference == 1, knows to re-read
5. reader reads x% data, interrupted, writer writes x+y%, interrupted:
reader sees In difference == 1, knows to re-read

Unless I have missed something, there is no need for locking, blocking, timeout waiting or unlocking by any thread or separate process. My worst case scenario is reader occasionally has to re-read.

Let me know what problems or issues you see with this methodology.

Nominal Animal · 05-14-2011, 11:06 AM

Quote:

Originally Posted by mbrose1994

Every write first increments the In (by 1), updates the data, writer's last action is to increment the Out (by 1). So a completed write leaves the In and Out different by 1. This provides every reader, no matter how many, and no matter how fast or slow they read, with simple tests to verify the read is good. And as a bonus, readers can save the last In value it saw and short circuit reads of the same data it already has.

Note that you need a memory barrier after the first increment, and before the last increment. Processors may do out-of-order writes, but the barrier makes sure the counters and the data stay in sync in all cases. Otherwise a reader may see equal counters before all of the data writes. The __sync_add_and_fetch() not only does the increment atomically, it also provides a full memory barrier.

(For Intel Compiler Collection, you should add a third parameter, a pointer to the 160-byte structure, so that ICC will make sure the barrier covers that data. GCC ignores the extra parameters, and always does a full barrier.)

I'm also assuming your code always compares for equality, not a difference of one. Whether the counters are equal or different initially does not really matter, it just inverts the interpretation of the test.

Quote:

Originally Posted by mbrose1994

Every reader first tests the shared memory In values against it's saved In value. If the same, reader can quit, it's data is still current. Otherwise, save In value, do the read. At the end of the read, reader gets the In shared memory value again, a second time. If the reader saved In is different, the writer has gotten in there, must re-read. Otherwise reader gets shared memory Out value, does final test against In value.

This should work fine, except that it may/will cause unnecessary cacheline ping-pong. If the writer is in progress while the reader starts, and the reader and the writer are on different CPU cores, the shared memory segment (cache lines) will start to bounce between the two. The more readers you have, the more likely this is to happen, and the longer the writer takes to update the data.

My suggestion tries very, very hard to avoid that. Readers will initially check if there is a writer in progress, and if so, back off for a bit, allowing the writer to proceed without "stealing" the cache lines. The larger the shared data segment is, the more important this becomes.

However, 160 bytes is not much at all. It is just twenty 64-bit copies, or forty 32-bit copies; never taking more than fifty clock cycles or so to copy. The writer must do the updates very often, and the readers must do a lot of reads (either often, or many readers), and there must be at least two CPU cores on the machine. The ping-pong probability increases linearly as the number of readers increases.

All in all, I think you are right: using the condition variable and having the readers wait on it, to avoid the ping-pong, is way overkill.

I would still add a reader mutex, so that only one reader is in progress at any time. This will keep the probability of getting into cacheline ping-pong between the writer constant, regardless of the number of CPU cores or readers. Consider it future-proofing. The memory barriers I first mentioned are essential on multi-core machines. (On single-core machines the kernel will do a memory barrier when scheduling processes.)

A couple of decades ago Borland had a popular Pascal compiler, Turbo Pascal, that had a timing loop in its runtime library. When processors became fast enough, the scaling factor computed from this loop became zero. Because the library used it as a divisor, old code stopped working due to a division by zero error. Not good. Also, 640k is enough for everyone, right?

mbrose1994 · 05-16-2011, 10:56 AM

Many thanks for your reply Nominal Animal.

A memory barrier I had not thought of, nor cacheline ping-pong. Thanks for being complete and specifying a function to accomplish the memory barrier. Also for the compiler explanation.

I am concerned about the cacheline ping-pong. With 8 CPUs and probably 50 readers, I should expect the problem. However, I am equally concerned about the performance hit of blocked readers. Most readers are separate processes, some are child threads, and they all absolutely cannot block.

So I am not sure how to approach this issue. Is there a way to force sync all the caches when the writer does his job? I know this would delay and/or interrupt active readers, but that would be a universal, systematic method that seems acceptable and safe.

I appreciate the future proofing story. I enjoy learning more about the roots of our science. Our history may be short compared to other sciences, but it still adds important, needed depth and context to our problem solving today.

Nominal Animal · 05-16-2011, 05:11 PM

Quote:

Originally Posted by mbrose1994

Many thanks for your reply Nominal Animal.

You're more than welcome. I'm very happy to try to help anyone actually willing to do this right!

Quote:

Originally Posted by mbrose1994

I am concerned about the cacheline ping-pong. With 8 CPUs and probably 50 readers, I should expect the problem. However, I am equally concerned about the performance hit of blocked readers. Most readers are separate processes, some are child threads, and they all absolutely cannot block.

Another option is to create a larger shared memory block, configured for a number of 192 or 256-byte slots (for 160 bytes plus the generation counters, and very good cacheline alignment), plus a common index naming the latest slot.

The writer would pick slots in a round-robin fashion, modifying the slot in the same way you're now writing, but also updating the index (to name that slot) when complete (using an atomic swap). The reader will of course start by looking up the current latest slot.

This should minimize any contention on the slot the writer is writing to. There should be enough slots that a thread has time to copy the data before the writer modifies that slot again. The reader must still use the generation counters to check if the data remained stable, though.

For extensibility, you could use an initial chunk in the shared memory to not only name the current latest slot, but also the size of each slot, the number of slots, and data version (major and minor -- minor for additions, major for changes). That way you can also develop the readers separately.

But first, let me check whether opening the mapping read-only in the readers will help with the cacheline ping-pong. I'll report my findings in this same thread.

You probably should ask a moderator to move this thread to the Programming section, so you'll get better coverage.

Nominal Animal · 05-16-2011, 09:12 PM

The attached mmap.tar.bz2 contains a test writer and a test reader. Both take two parameters: usleep duration (0 for no sleeping), and loop count between outputs.

I ran some tests on a four-core Opteron 275 with busy looping (zero sleeps). (The sources are attached, in case you're interested in testing it on your machines. I consider the code to be in public domain, feel free to use it any way you like. No guarantees from me, though.)

The typical update and copy time was consistently about 4µs, regardless of the number of readers. For ten readers, the average update and copy times were about 11µs. For a hundred readers, the average was about 110µs. (The machine was running a lot of other software at the same time.)

The average grows because the processes get scheduled at longer intervals when the number of processes grows; the occasional "very slow" read or update happens because the CPU ran other processes in between. These take the longer the more there are processes.

I did not see any indications of any sort of cacheline ping-pong. My readers do open and map the shared memory segment explicitly read-only, and my writer does explicit msync() calls. Either the CPU cores can cache the mapping simultaneously, or the kernel gives each reader process a private copy of the shared memory segment. In both cases the caches (cached copies) are invalidated (updated) at msync(), making sure the changes propagate to all processes. Therefore, this scheme should work well, regardless of the number of readers and cores. Specifically, expect a linear increase in average running time, when the number of readers increases past the number of CPU cores.

As I only ran this on a 4-core machine, the above is far from reliable. It'd be better to go crawling in the kernel sources and check. Running the same test on a 16- or 24-core machine would tell a lot more.

mbrose1994 · 05-17-2011, 03:44 PM

Wow, excellent, and thank you. I will give a look and try.

And, I ended up using sync_synchronize() for my memory barrier. I tried sync_add_and_fetch(), but it did NOT work. The mb(), rmb(), wmb() and smp_(versions) did not compile, did not have right header on my system (???). Since sync_sync... seems to be working, not an issue at this time.

Meanwhile, I did a slightly different trial with some decent results. I wanted to test how often and how many cycles might occur of RE-READING.

The machine specs:
_____OS: Linux 2.6.18 #1 SMP x86_64
____CPU: 8, Intel Xeon X5667 3.07GHz
__cache: 12288 KB
clflush: 64
clalign: 64
__cores: 4
____RAM: 24.68 Gig

Test cfg: 1 writer, at 1000 Hz, run 12000 seconds
________: 40 readers, 29 at 1000 Hz, 11 at 15 Hz

Results so far:
writer: 3.67 million writes
__67 Hz readers (11): 70 re-reads, all got good data on second read
1000 Hz readers (29): 2100 re-reads, 98% good on second read.
Interestingly, some rare, much bigger re-reads for the 1000 Hz readers only.
Some readers needed 50, 60 and 3 at 101 re-reads.

But, no blow outs, ie, readers who cannot get a good read.
The very occasional larger read is a mystery. They are quite spread out, not in each others write count neighborhood. The re-read does not sleep, it jumps back to the top immediately.

At this point I think I can declare success and lock this down as version 1.0.

Of course all comments are welcomed. Software can always be improved.