[SOLVED] msync() and NFS isn't working as thought

chcarver · 04-04-2012, 11:13 AM

I'm having difficulties sharing information between two processes that have mmap()'ed a file that is on a NFS file system.

Scenario:
Two processes run in parallel updating the same file. One process increments the value in the file when it's even. The other process updates the value when it's odd. So they basically toggle each other back in forth. This is a proof of concept, as the code has no real business value.

This works perfectly if the processes are using a local filesystem like EXT2 or EXT3. However that requires both processes to be on the same system. Also, (this is important) this works on NFS, if both processes are on the same system.

Problem:
When the processes are on different systems and the file is on a NFS shared between the system it no longer works.

The Linux NFS FAQ (http://nfs.sourceforge.net/) states to use msync() with the MS_SYNC option to insure a write happens. Add it does, but the mmap()'ed portion of the file is NOT updated. I use the mmap() option MAP_SHARED to insure writes from one process update the contents from other processes with the same mapped contents.

I do confirm that the contents of the file is updated via mysnc() call on the opposite system from where the file was last updated. Just the process isn't being made aware of the change.

There seems to be a missing link to when the file system is updated to the process user space being updated with NFS and not the local filesystem.

Ideas, suggestions, or answers I'll take them.

Thing1

Code:

/*
 ============================================================================
 Name        : Thing1.c
 ============================================================================
 */

#include <stdio.h>
#include <stdlib.h>

#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <unistd.h>
#include <limits.h>
#include <stdint.h>

#define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

int
main(int argc, char *argv[])
{
  char *addr;
  int fd;
  uint32_t *inc;

  fd = open(argv[1], O_RDWR | O_SYNC);
  if (fd == -1)
      handle_error("open");

  addr = mmap(NULL, 4, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  if (addr == MAP_FAILED)
      handle_error("mmap");

  inc = (uint32_t*)addr;

  while( *inc < ( 4294967295 ) ) // 65535 ) )
  {
    while( ( *inc %  2 ) == 0 );
    if( *inc < ( 4294967295 ) )
    {
      printf( "value = %d\n", *inc );
      *inc += 1;
    }
    msync( addr, 2, MS_SYNC );
  }
  //msync( addr, 2, MS_SYNC );
  exit(EXIT_SUCCESS);
} /* main */

Thing2

Code:

/*
 ============================================================================
 Name        : Thing2.c
  ============================================================================
 */

#include <stdio.h>
#include <stdlib.h>

#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <unistd.h>
#include <limits.h>
#include <stdint.h>

#define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

int
main(int argc, char *argv[])
{
  char *addr;
  int fd;
  uint32_t *inc;

  fd = open(argv[1], O_RDWR);
  if (fd == -1)
      handle_error("open");

  addr = mmap(NULL, 4, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  if (addr == MAP_FAILED)
      handle_error("mmap");

  inc = (uint32_t*)addr;

  while( *inc < ( 4294967295 ) )//65535 ) )
  {
    while( ( *inc %  2 ) != 0 );
    if( *inc < ( 4294967295 ) )
    {
      printf( "value = %d\n", *inc );
      *inc += 1;
    }
    msync( addr, 2, MS_SYNC );
  }
  //msync( addr, 2, MS_SYNC );
  exit(EXIT_SUCCESS);
} /* main */

Nominal Animal · 04-04-2012, 09:23 PM

Quote:

Originally Posted by chcarver

I do confirm that the contents of the file is updated via mysnc() call on the opposite system from where the file was last updated. Just the process isn't being made aware of the change.

In other words, the underlying file is modified when msync() is called, but the page cache on the reading client is not updated, and therefore the other client having a mmap to the same file does not see the updated contents in the map (unless you call msync() on the reader, too).

I don't have a suitable setup to test right now, but I think using MS_SYNC|MS_INVALIDATE tells the NFS server to tell all other processes having the file mapped to invalidate their caches. After invalidating their caches, the first read of the affected pages will cause the pages to be read from the NFS server, at which point they will see the updated data.

Other possibilities are

invalidate the local mapping every time before reading it, or
explicitly read the bytes from the file using e.g. pread() or lseek()+read().
(I am not certain if the mapping gets updated in this case, however; you should check.)

There are grave concerns I have with this scheme you are using.

All updates to a memory-mapped file (through the memory map) are page granular. Effectively, each page is an atomic unit which gets updated, or not. A surprise MS_INVALIDATE made by another process has to be considered a write in this case, because it will revert any local modifications a process has made to the affected page, reverting it back to the file contents on the NFS server. It is very difficult to detect when that happens. (I think all reliable schemes involve a local copy of the updated data, and two separate msync() calls per data -- very inefficient and slow.) Because of this, I do not believe your scheme, as it is right now, can really work over NFS.

Because of the page cache, memory-mapping a remote file for reading it will always "suffer" from cache effects, because the mappings cannot be physically shared! In most real-world use cases this is actually usually beneficial rather than a problem: the reader is satisfied with a snapshot of the file. While I personally use memory maps a lot, I only use them with local files.

In all, I hope you don't find this too frustrating. NFS and remote filesystems are tricky beasts, and they are often configured wrong. (For NFS, the lock daemon is usually either misconfigured and disabled, in which case advisory file locks do not work; they're local to each host. This is very common with web hosting services.) The fact that NFS and memory maps shared between remote hosts is a very complicated situation is par for the course.

Fortunately, it turns out that using socket-based comms messages is a lot more efficient for data exchange between remote processes. If you don't want to roll your own, use MPI. If your messages are self-contained and do not need strict ordering, you can use UDP sockets for maximum performance. If you have a very large dataset, promote one process on one host to a master or data broker which maps the dataset into memory, then exchanges the data between itself and remote processes via sockets.

In molecular dynamic simulations, large systems are spread between multiple hosts (computing nodes in a cluster, usually) connected via network (usually MPI over InfiniBand). There, each process "owns" its own region of the simulated system, and communicates with processes owning the neighbouring regions, to decide how the boundary area is handled. Usually the regions are spatially divided, so there is no ambiguity. If your use case has similar properties, you really should not try to map all the data for all the processes, but let each process worry only about their own task, only making sure the boundary regions are handled correctly. For example, each process could have a separate file containing its data, while also communicating with any or all of its peers to any issues near the boundaries.

Hope this helps,

chcarver · 04-05-2012, 12:09 PM

Thank you for such a well thought out response. I appreciate that greatly.

I tried the MS_INVALIDATE flag and had no success. Try as I might, the process's user space mmap()'ed region isn't being updated of the file change. My next course of action is to try inotify when the file has been changed. I'll see if this translates over NFS.

What the code isn't showing is the use of semaphores in the file per page size of mmap(). (I can't post the 'real' code online, just the important bits that I'm refering to.) So everything was fine until it was decided to offload larger and more resource intesive processing between multiple systems. There is a lot of cross sharing between the processes as they chew on data and then check on other processes are doing also. (It's a bit like wolves all sharing a carcass tearing into it.) Dirty memory is something that was factored in long ago and coded for. And so far mmap() has worked flawlessly for a single system.

I can make a proxy bridge using a message transport as you suggest that mmap()s files between systems. A bit more code, but doable without adding complexity or overhead to the exsisting code.

I'm going to mark this post as resolved, as I do not see any direct solution for open(), mmap(), or msync() that is involved. Thank you so much for confirming my suspections.

Nominal Animal · 04-06-2012, 04:18 AM

Quote:

Originally Posted by chcarver

I tried the MS_INVALIDATE flag and had no success.

I was afraid that might be the case. Apparently, cache invalidation is local to one host only.

Quote:

Originally Posted by chcarver

I do not see any direct solution for open(), mmap(), or msync() that is involved.

Neither can I.

Quote:

Originally Posted by chcarver

What the code isn't showing is the use of semaphores in the file per page size of mmap().

You could replace the semaphores with one master page broker (containing a full copy of the data set), and one page broker per host. You won't need NFS for this at all, since the shared mappings on each host are synchronized by the page brokers.

In short, the local page brokers maintain a list for each page they own for outstanding local requests. The master page broker has the full data set, a list of local page broker requests for each page, and the last local page broker owners for each free page.

When a client starts, it connects to the local page broker (using e.g. a unix domain datagram socket), and asks for the mapping. You can either use a local file, or transfer the file descriptor using an SCM_RIGHTS ancillary message.

Instead of waiting on a semaphore, the clients request the ownership of each page from the local page broker. (Correspondingly, instead of posting a semaphore, they just tell the page broker they're done with the page.)

Each page broker may serve all outstanding local requests to the page before releasing the page. This will break strict ordering rules obtained with semaphores, but will drastically cut the network transfers for contented pages, giving a huge speed boost.

When a local page broker first gets a request for a page, it asks it from the master broker. The master broker may either tell the requester to go ahead with the copy it has, or send the current contents of the page. After it has no outstanding requests to the page, the local page broker will send the modified page to the master. (This way the master always retains a full copy of the data, instead of it scattered between the hosts.)

One might think that sending the page contents may be slower than arranging for peers to transfer the page on request, but that turns out to be correct only for huge pages, and usually incorrect for 4096-byte pages. Most transfers are not limited by bandwidth, but by packet count. Each transfer has a significant latency, i.e. takes a significant amount of time, whereas sending a larger packet may take only very slightly longer. In practice it is best to minimize the number of transfers or messages needed, even if some or even most of the packets contain superfluous data.