LinuxQuestions.org - How can I get maximum performance on a multi-processor machine?

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - How can I get maximum performance on a multi-processor machine? (https://www.linuxquestions.org/questions/linux-software-2/how-can-i-get-maximum-performance-on-a-multi-processor-machine-889296/)

Quote:

Originally Posted by michael@actrix (Post 4402982)

md5sum /dev/zero

Hehe. Nice trick. :D.

Sure - pretty vanilla, I think:

CFLAGS=-v -m64 -Wall
CPPFLAGS=-m64 -Wall
LDFLAGS=-m64 -lpthread
Linker: Linker.o AStore.o DataLayer.o
g++ $(LDFLAGS) -o Linker Linker.o AStore.o DataLayer.o

Quote:

Originally Posted by Tinkster (Post 4402120)

Could you post the compiler options and such that you've used building
your program?

Cheers,
Tink

Thanks htop was very informative. I think I may have another problem that I did not describe in the original question. This is a multithreaded app that reads & writes to one very large file. Each thread reads & writes to a different area of the file, but perhaps there is a common lock in Linux? That could explain the processing problem, I think. Hmm... more to work on -

Quote:

Originally Posted by michael@actrix (Post 4402982)

Try htop - it displays a bar for each CPU - see if they're all high or not. There is also iotop for io.

md5sum /dev/zero

will keep a CPU busy indefinitely. Run up several and use htop to see if all the CPU's get going.

That was why I suggested you try 4 cores on Linux - to see if you get comparable performance to Windoze, and (perhaps) provide evidence that your design won't scale properly.

There is no "common" lock for I/O that I'm aware of, but different filesystems may use similar. Striping the file across multiple devices may help.

The I/O patterns are very important.

If each thread does random accesses in its window, you might try using mmap() instead of file I/O. You would use msync() to commit changes to disk if some other program needs to see the changes in real time, and madvise() to provide hints to the kernel if you know what data you'll need sometime in the future. Note that you cannot use stdio here, you'll need to use open() and other <unistd.h> low-level I/O routines instead. stdio is terribly slow anyway.

If the data files are small enough to fit in RAM, use mmap().

If the per-thread windows into the data are not aligned into page-sized units, you may have serious cacheline ping-pong at the boundary regions. The delays may be surprisingly long for a machine with so many cores. You can use sysconf(_SC_PAGESIZE) to query the page size at runtime.

If you cannot align the access windows, consider doing odd and even segments in two different passes (so that each segment worked on is surrounded by fallow segments). Or consider splitting the data into multiple separate files first, at the desired boundaries.

If each thread does linear read and write passes over the data window (even simultaneously, as long as the write trails the read), and the datasets do not fit into memory, read() and write() often surpass mmap() in speed. Use very large blocks, though; I'd recommend 2MB (2097152 bytes) if possible. Use posix_fadvise() to inform the kernel about the blocks you won't use for a while, and about the blocks that are going to be read next, so that the kernel can handle the real I/O while you process the data. You might consider handing off the writes to a dedicated thread (all threads hand off writes to a single thread, using a queue). The writer thread will consume very little CPU time, so it's OK if it gets scheduled on a core used in the computations, but your worker threads are free to work on the next block.

If you do reasonably large blocks, say 65536 bytes, or larger powers of two, and you can inform the kernel sufficient time prior to actually needing the block, you should be able to keep processing data at the same time the kernel is loading new data into memory. (The page cache in the Linux kernel is pretty efficient. Amazing, if you ask me.)

Most disk systems cannot provide maximum thoroughput when small data blocks are used. You may need to use as big as two megabyte chunks (to read and write your data), to get optimum performance. This applies mostly to read() and write(), but in lesser ways to msync() too.

If you can tell us how your program walks the data while processing it, whether it updates the data, or writes results to somewhere else, and whether the worker threads access the same areas of the file(s), we might be able to suggest something more specific to try. As it is, we're talking in only very general terms here.

If you are unfamiliar with mmap(), check my example program in this post. It uses an ephemeral sparse data file (creates it, and deletes it after done), to modify a very sparse data structure spanning a terabyte, using basic mmap() techniques.

I don't think locking would be an issue if you're using a native Linux file-system. Calling things like fflush() might cause problems if called too frequently. I would be tempted to use gprof or some other profiler to find out where the app is spending it's time. Perhaps strace might give some quick clues at the system call level (man strace): for example

strace -c -S time myProgram

If very large is not too very large, then /dev/shm might be a fast place to put a file, and then copy it somewhere permanent when it's finished being created.

Quote:

Originally Posted by imperialbeachdude (Post 4404790)