The I/O patterns are very important.
If each thread does random accesses in its window, you might try using
mmap() instead of file I/O. You would use
msync() to commit changes to disk if some other program needs to see the changes in real time, and
madvise() to provide hints to the kernel if you know what data you'll need sometime in the future. Note that you cannot use stdio here, you'll need to use
open() and other
<unistd.h> low-level I/O routines instead. stdio is terribly slow anyway.
If the data files are small enough to fit in RAM, use mmap().
If the per-thread windows into the data are not aligned into page-sized units, you may have serious cacheline ping-pong at the boundary regions. The delays may be surprisingly long for a machine with so many cores. You can use
sysconf(_SC_PAGESIZE) to query the page size at runtime.
If you cannot align the access windows, consider doing odd and even segments in two different passes (so that each segment worked on is surrounded by fallow segments). Or consider splitting the data into multiple separate files first, at the desired boundaries.
If each thread does linear read and write passes over the data window (even simultaneously, as long as the write trails the read), and the datasets do not fit into memory,
read() and
write() often surpass mmap() in speed. Use very large blocks, though; I'd recommend 2MB (2097152 bytes) if possible. Use
posix_fadvise() to inform the kernel about the blocks you won't use for a while, and about the blocks that are going to be read next, so that the kernel can handle the real I/O while you process the data. You might consider handing off the writes to a dedicated thread (all threads hand off writes to a single thread, using a queue). The writer thread will consume very little CPU time, so it's OK if it gets scheduled on a core used in the computations, but your worker threads are free to work on the next block.
If you do reasonably large blocks, say 65536 bytes, or larger powers of two, and you can inform the kernel sufficient time prior to actually needing the block, you should be able to keep processing data at the same time the kernel is loading new data into memory. (The page cache in the Linux kernel is pretty efficient. Amazing, if you ask me.)
Most disk systems cannot provide maximum thoroughput when small data blocks are used. You may need to use as big as two megabyte chunks (to read and write your data), to get optimum performance. This applies mostly to read() and write(), but in lesser ways to msync() too.
If you can tell us how your program walks the data while processing it, whether it updates the data, or writes results to somewhere else, and whether the worker threads access the same areas of the file(s), we might be able to suggest something more specific to try. As it is, we're talking in only very general terms here.
If you are unfamiliar with mmap(), check my example program
in this post. It uses an ephemeral sparse data file (creates it, and deletes it after done), to modify a very sparse data structure spanning a terabyte, using basic mmap() techniques.