Memory map slow-down

mcastellana · 04-07-2012, 07:14 PM

Hello,
I wrote a code (a.cpp) which allocates (with malloc()) and makes algebraic operations with very large vectors of doubles. The vectors are so large that they cannot be easily kept in RAM. So I wrote a second code (b.cpp) doing exactly the same thing, but using mmap() to allocate the pointers do double. Both the codes run on several cores with MPI.

I measured the running time on ONE core for the old version of the code a.cpp and for b.cpp and there is no significant slow down.

However, a significan slow down occurs when I run the code on n >> 1 cores with MPI. Both a.cpp and b.cpp are MPI codes, and during running time the different cores do not communicate at all. Only in the end, all the cores different from 0 send the results to core 0, core 0 collects them and write them to a .dat file. In a.cpp, every core allocates a very large double pointer with malloc, while in b.cpp, every core memory-maps a pointer to double to a different file (one different file for every CPU, all the files in the same folder).

My problem is that when my run a.cpp and b.cpp over n>>1 cores, b.cpp is much slower than a.cpp. This is not true when the number of cores is equal to 1.

Do you have any ideas of why? Is it that all the cores are trying to write to files in the same folder on hard disk and this slows down the code when n >> 1?

Thanks!

ta0kira · 04-07-2012, 07:59 PM

You're probably making the disk continually seek two very different locations. If you're only using one core and the associated thread stays in the same area of the disk for a short time, that will be more efficient than if the disk has to attempt to be in two places at once.
Kevin Barry

Nominal Animal · 04-08-2012, 04:13 AM

I concur with ta0kira above. You need to access more RAM than you have, so you are limited by your storage I/O speed.

If you know your use patterns, you can try to mitigate the issue by using madvise(). Mark the entire mapping MADV_RANDOM, then issue MADV_WILLNEED for regions you will be accessing soon in the future. You can issue msync(addr,length,MS_ASYNC) for each region you won't be accessing soon -- although it may not work that well with anonymous (swap-backed) mappings.

Remember that mappings are handled in page-sized (sysconf(_SC_PAGE_SIZE)-byte) blocks. Each block is either in RAM, or on disk.

A warning: MADV_DONTNEED would unconditionally discard the pages. If you use it, you will have to make sure the data is msync()ed to disk first, or you risk losing the data! (I seem to recall that msync(MS_ASYNC) followed by an immediate madvise(MADV_DONTNEED) is not sufficient, but I might be wrong.)

If you use a shared mapping on a file (mmap(NULL,length,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_NORESERVE,fd,0)), you can use a separate thread for evicting the pages from memory. That thread will issue msync(addr,length,MS_SYNC) (which will block until the data has been written) followed by madvise(addr,length,MADV_DONTNEED) for each area, causing it to be saved back to the file. Be careful that the threads will not try to access that area while that command pair is being run, though; you might need a separate (not in the mapping!) per-region semaphore. If the evicter thread follows the worker threads, and the worker threads will issue the MADV_WILLNEED suitably in advance, and your calculation bandwidth is smaller than the I/O bandwidth, you won't see any slowdown.

If your calculation bandwidth is greater than your I/O bandwidth -- the calculations go through more data than you can read and write from disk -- then you're I/O limited anyway.

mcastellana · 04-08-2012, 02:00 PM

Thank you guys for your answers. It seems that you are right. Nominal Animal, unfortunately I am very unfamiliar with most of the terms that you have been using in your reply (madvise(), MADV_RANDOM, ...), but I am trying to figure out what they exactly mean.

I was wondering if there is an easy way out of the problem. Let me explain this to you. Let us say that my code wants to multiply a N X N matrix A by a N X N matrix B, and wants to do this for M different samples of A and B. The final result of the code is the average of A X B = C over all the samples. The way I parallelized this code is the very simplest one: give a bunch of samples to each CPU, and then average in the end. In a.cpp each CPU allocates A, B, and the results C in RAM, while in b.cpp A, B and C are memory-mapped to a file (a different file for each CPU). When the CPU number n becomes large, there is a slow-down in b.cpp compared to a.cpp, and it seems that this is because of I/O bandwidth.

What if I tried to overcome this problem by parallelizing the code in a different way? For instance, I could parallelize the matrix-matrix product: CPU#0 computes the matrix-matrix product by distributing the calculation amongst all the other CPUs and then does this for all the M samples. I could do this by switching from the BLAS function for matrix-matrix product dgemm() to its parallelized version pdgemm().

But will I have other I/O issues this way?
Thank you so much!
Best
Michele

ta0kira · 04-08-2012, 04:14 PM

You might think of it in terms of number of I/O operations required. Each new element of C will require the addition of N values, which are read and multiplied from 2N elements in A and B. Ideally, you would read each element of A and B exactly once, and write each element of C exactly once. I don't think that's possible without reading A and B into memory in their entirety, however. The next-best option might be to read the elements of A and B exactly once and minimize the number of writes to the elements of C. Unfortunately, it seems to me that the only real way to parallelize matrix multiplication is to parallelize access to the elements of the output matrix.

See if the suggestion below gives you any ideas. It only makes sense to run it on a single core.

You can decompose the matrix multiplication into a sum of outer products of 2 N-element vectors. In other words, you can read column k from A and row k from B, compute an outer product, and sum those products over k=1..N to get C. If you have M pairs of A and B whose products needs to be summed, you can just think of this as a single A that's NxNM and a single B that's NMxN. Choose a parameter P between 1 and NM to indicate the number of A-column/B-row pairs you can afford to store in memory at once. Below is some pseudo-code explaining the rest, since it's difficult to verbalize.

Code:

A = NxNM mmaped matrix (reading might be more efficient if you store A^T, rather than A, in the file)
B = NMxN mmaped matrix
C = zeroed NxN mmaped matrix

P = number of vectors to read

for i = 0 to NM-1, step P
	Pi = min(P,NM-i)

	A_temp = columns i:(i+Pi-1) of A (NxPi, copied to RAM)
	B_temp = rows    i:(i+Pi-1) of B (PixN, copied to RAM)

	for j = 0 to N-1
		C_temp = zeroed N-element row vector (1xN, stored in RAM)

		for k = 0 to Pi-1
			C_temp += A_temp[j,k]*B_temp[k,*]

		C[j,*] += C_temp (partial sum of row j of C)

Here, every input element is read exactly once and every output element is written ⌈NM/P⌉ times (plus a read each time for +=.) I don't think I/O can be optimized more than that (letting P=NM would be completely optimal wrt number of I/O operations.) The memory requirement for the code above is N(2P+1) elements. You should also optimize matrix storage such that each row occupies contiguous space in the respective file, minimizing "seek" operations. The only alternatives I can think of that would allow multithreading to improve performance would require that multiple NxN matrices be stored in RAM.
Kevin Barry

Nominal Animal · 04-08-2012, 04:53 PM

Quote:

Originally Posted by mcastellana

Nominal Animal, unfortunately I am very unfamiliar with most of the terms that you have been using in your reply (madvise(), MADV_RANDOM, ...), but I am trying to figure out what they exactly mean.

Oh, sorry about that. They are functions related to memory maps created using the mmap() function. They are needed when your dataset exceeds the amount of available RAM, if the kernel does not do a good enough job automatically. (Surprisingly often, it does, actually.)

Quote:

Originally Posted by mcastellana

Let us say that my code wants to multiply a N X N matrix A by a N X N matrix B, and wants to do this for M different samples of A and B.

Ah, that obviously opens up a large number of new possibilities.

If you had -- in theory, this is just for illustration and understanding --

Code:

double A[M[N][[/I]N[/I]];
double B[M[N][[/I]N[/I]];
double C[N][N];
double t;
int    row, col, m, n;

then you could compute the result using

Code:

for (row = 0; row < ROWS; row++) {
    for (col = 0; col < COLS; col++) {

        C[row][col] = 0.0;

        for (m = 0; m < M; m++) 
            for (n = 0; n < N; n++)
                C[m][row][col] += A[m][row][n] * B[m][n][col];

        C[row][col] = C[row][col] / M;
     }
}

Now, assume that instead of B, you had D=transpose(B). Then, the innermost statement would be

Code:

C[row][col] += A[m][row][n] * D[m][col][n];

For simplicity, let us assume we clear and divide the result matrices in separate passes:

Code:

for (row = 0; row < ROWS; row++)
    for (col = 0; col < COLS; col++)
        C[row][col] = 0.0;

for (row = 0; row < ROWS; row++)
    for (col = 0; col < COLS; col++)
        for (m = 0; m < M; m++) 
            for (n = 0; n < N; n++)
                C[row][col] += A[m][row][n] * D[m][col][n];

for (row = 0; row < ROWS; row++)
    for (col = 0; col < COLS; col++)
        C[row][col] /= M;

As you can see, now the summation loops can be done in any order. You can put the four for loops in the middle group in any order, and the result will stay the same, except for rounding errors.

While this is the slow, naïve matrix multiplication algorithm, which is a lot slower (scales as O(N^3)) than most library functions (which scale at about O(N^2.37)). For N = 100000, they differ by a factor of 1400. It may sound like the naïve algorithm should always lose, but because disk I/O is so much slower than RAM access, using the naïve approach may be faster if it leads to less total I/O needed.

At this point, I'd like to ask if you can tell anything specific about how the data is generated: is it on file, or do you compute the elements. Do you need to compute them in specific order? Are there internal dependencies on which order the elements are computed?

If you can save or compute the same element for different A and B matrices consecutively in memory, i.e. logically

Code:

double A[N][N][M];
double D[N][N][M]; /* B transposed */
double C[N][N];

then the innermost loop could be

Code:

for (row = 0; row < ROWS; row++)
    for (col = 0; col < COLS; col++)
        for (n = 0; n < N; n++)
            for (m = 0; m < M; m++) 
                C[row][col] += A[row][n][m] * D[col][n][m];

and each resulting element would be a dot product of two N*M vectors consecutive in memory. Current processors can vectorize that, cutting down the computation time to almost one half or one fourth, depending on CPU and code. Even better, each thread could independently calculate one element in the result, only requiring 2*N*M doubles in memory.

If you have enough RAM to keep D[N][N][M] fully in RAM, and one row of A, A[row][N][M] , and one row of C, C[row][N],i.e. each machine has 8*N*N*M+8*N*M+8*N bytes of RAM available for this task, then you could do this linearly and with minimal I/O. Each machine creates a pool of threads, and mmaps the row of A and entire D read-only. Different machines map different rows. Each thread in the pool will pick the next available column, and calculate the result for that row and column, until all columns are done. Analoguously, each machine will pick the next available row, and calculate all resulting elements for that row. Only the D matrix needs to be broadcast to all machines (but remember to save to a file and use a read-only mapping to avoid a problem called "cacheline ping-pong", where CPU cores fight over who "owns" the data). Each machine will only need one row of A at a time, so they can be transferred from the master node. To avoid network delays, have each machine have the next row of A already available; receiving the next row asynchronously while computing the current row, using the MPI asynchronous receive function.

mcastellana · 04-11-2012, 08:57 AM

Dera Nominal Animal and ta0kira,
Thank you very much for your suggestions. I am currently trying to follow them and see if I can find a good compromise between memory consumption and speed. I hope that I will come up with something that works soon and that I will post it here!

Best and thanks again

mcastellana · 04-19-2012, 06:16 PM

Dear Nominal Animal,
I tried you suggestion to use madvise(). First, I mmap() only the pointers that are too large to be kept in the RAM. Whenever I mmap() a pointer p, I do madvise(p,filesize+1,MADV_RANDOM); . Then say that I want to use
p+M, p+M+1, ..., p+M+N. Then I set p_temp = p+M; and I do madvise(p_temp,N*sizeof(double),MADV_WILLNEED); . After this I do msync(p_temp,N*sizeof(double),MS_ASYNC); . Unfortunately there is no speed-up compared to the code with no madvise commands, am I doing anything wrong?

Thank you for your help!
Best
Michele

sundialsvcs · 04-20-2012, 08:20 AM

The computery sciencey term for it is thrashing.

I still jokingly call it Maytag Mode, because back in the day I worked on a machine where the disk drives (with removable disk packs) for some reason looked to me rather like small Maytag(R) washing-machines. There wasn't much memory to be had, "back in the day," so disk swapping was a fact of life and sometimes those drives would really "go to town." They would literally slide across the floor.

So much so, in fact, that one fateful day the main drive that was used to hold swap-space slowly edged its way ... to the edge of ... the raised floor ... and-d-d-d ... "

"!

It certainly gave new meaning to the word, "disk crash."

Virtual Memory is based on the notion of locality of reference, which is an idea that would have gotten you a PhD at one time. The idea is that "the next piece of data which the CPU is going to ask for, is likely to be nearby to one of the most-recently requested ones." That probability is unknown, and constantly changing, but nevertheless exploitable. What does this mean to you? It means that you need to design programs to exhibit locality of reference. Any "memory" access that you do might turn into a "disk file" reference without warning. Design the algorithm so that it concentrates its consecutive memory accesses into a few small groups of address-ranges instead of spraying them all over the memory-space. (And, keep your PC well away from the edge of that raised floor.)

Nominal Animal · 04-20-2012, 09:57 AM

I completely agree with what sundialsvcs wrote above, and particularly this:

Quote:

Originally Posted by sundialsvcs

Design the algorithm so that it concentrates its consecutive memory accesses into a few small groups of address-ranges instead of spraying them all over the memory-space.

This is exactly what I was trying to convey in my post #6 above. Your use case, mcastellana, is such that you should be able to not only group the accesses, but also parallelize the work very efficiently to multiple threads. It will be a bit more work, but it certainly is doable.

If it happens that you don't have enough RAM to apply the scheme in that post, you can convert your huge matrix to a number of smaller matrices (by rearranging the data), operate on the smaller matrices, then reconstruct the result matrix. The Block matrix Wikipedia article describes the details pretty well. There are further points to note if you decide to go this way (that affect the computation speed), so do let us know what you decide on.

Based on my experience, I'd say that the first approach is easier to implement, but the latter will yield a more efficient implementation.

ta0kira · 04-20-2012, 11:11 AM

Quote:

Originally Posted by Nominal Animal

If it happens that you don't have enough RAM to apply the scheme in that post, you can convert your huge matrix to a number of smaller matrices (by rearranging the data), operate on the smaller matrices, then reconstruct the result matrix. The Block matrix Wikipedia article describes the details pretty well. There are further points to note if you decide to go this way (that affect the computation speed), so do let us know what you decide on.

I described something like that, but instead of block matrices I broke the input matrices down by rows/columns. Using rows/columns allows you to read data from contiguous parts of the files (provided one matrix is stored as a transpose,) whereas using block-matrix representations will still cause you to read from discontiguous parts of the file. Additionally, the procedure I described only reads the input data once, and output is also done in contiguous file segments.
Kevin Barry

Nominal Animal · 04-20-2012, 12:10 PM

Quote:

Originally Posted by ta0kira

instead of block matrices I broke the input matrices down by rows/columns

Yes, but that too assumes a significant part (at least a couple of complete rows/columns) of the matrix fits in memory. If you have a number of threads working in parallel, and have memory limits you cannot really avoid, it may be a problem.

Besides, your approach does also need a transposed version of the source matrix, or each of the (NM/P) outer loop steps need to read-skip through the entire matrix to get the P full columns for each step, right?

Block matrix approach does not have that limitation. You can stream the input data into a number of temporary files in parallel, effectively building all the block matrices in one go. While you can parallelize this too, it is only useful if the workers reside on different computation nodes (machines), since it will always be I/O bandwidth limited. You do need to have a lot of files open, though, or you will be forced to do multiple passes.

It may speed up the computation, but generate more I/O, to also save the block matrices transposed. This is only useful if the blocks are very large, larger than about half the per-CPU-core data cache.

You can fire up the worker threads to work on the block matrices when the first row of block matrices complete. I personally would do some testing to see if there is a simple way to define the work order so that no two threads access the same block matrix at the same time, and that operations targeting the same result block matrix are preferably done consecutively (if possible), to reduce the result I/O. It should not be too difficult, fortunately.

Quote:

Originally Posted by ta0kira

The only alternatives I can think of that would allow multithreading to improve performance would require that multiple NxN matrices be stored in RAM.

This is kind of my point; using the block matrix approach does that, while only requiring
temporary storage for twice the original matrix -- one block matrix set for the input (mapped read-only) and one for the results --, but only three times as many block matrices as there are working threads would be needed in RAM at any point.

For 1024x1024 block matrices using doubles, that is 8MB per block matrix; about 24MB per thread. At 2048x2048, about 96MB per thread. At 4096x4096, about 384MB per thread. (I would start with the middle one, comparing to the two others. I like round, power of two values, although it should not matter in this particular case.)

Mapping a file does have an intrinsic cost, which means the block matrices should be large enough to lower that cost to insignificant levels. On the other hand, if the block is large enough to take less than third of the available CPU core cache, the multiplication will be faster. On the gripping hand, if you have the transposed version of the right side matrix, the access is to consecutive parts of memory, and you can have much larger block matrices without suffering from the cache penalty.

Finally, a lot of the efficiency can be lost if the block matrix operations are not suitably ordered. You would need a governor to make sure each input and output block matrix is only used by a single thread at once (to avoid cacheline ping-pong), and to preload the next ones early enough so you work at maximum I/O speed. Also, trying to operate on the same result block matrix in consecutive operations will reduce the amount of I/O needed, thus making the entire operation more efficient.

So, mcastellana has several approaches to pick from, which I think is a good thing.

I definitely do not know if this is even in theory better than what ta0kira suggested! If I even implied that accidentally, I owe you an apology.

I personally would do some preliminary testing to see if there is one that is clearly superior or inferior (I suspect not), then pick one that suits my programming skills and approaches best.

ta0kira · 04-20-2012, 09:05 PM

Quote:

Originally Posted by Nominal Animal

Yes, but that too assumes a significant part (at least a couple of complete rows/columns) of the matrix fits in memory.

Even if the output matrix has 1T elements, that makes each vector only 1M elements (for double that's 8MB.) I'm sure OP can fit at least 3 vectors in memory at once, which is the absolute minimum required by my method (although output performance would be about 25x better with 51 vectors.)

Quote:

Originally Posted by Nominal Animal

You do need to have a lot of files open, though, or you will be forced to do multiple passes.

It may speed up the computation, but generate more I/O, to also save the block matrices transposed. This is only useful if the blocks are very large, larger than about half the per-CPU-core data cache.

You can fire up the worker threads to work on the block matrices when the first row of block matrices complete. I personally would do some testing to see if there is a simple way to define the work order so that no two threads access the same block matrix at the same time, and that operations targeting the same result block matrix are preferably done consecutively (if possible), to reduce the result I/O. It should not be too difficult, fortunately.

Storing blocks in separate files does nothing to change the thrashing problem, especially since each of those files must be read multiple times. Having multiple files open also won't change anything because that implies that the data is still read multiple times, and storing blocks in separate files certainly doesn't reduce the amount of fragmentation. Threading the computations of individual output blocks compounds that problem without having copies of the data on different drives.

Transposing the left matrix would certainly take some time, but if it's a one-time computation, it would be fine to just use the left matrix as-is if it isn't already transposed (which only changes the indexing in one line of my example.) Also, I assumed that there is a computational source for the left matrix that might be able to just write it in a transposed format. The point is that the method I suggested reads all input data exactly once and performs all necessary multiplications that involve the loaded before loading new data, and the efficiency of output is directly controlled by the number of vectors that can be stored in memory at once.
Kevin Barry

Nominal Animal · 04-21-2012, 12:08 PM

Quote:

Originally Posted by ta0kira

the method I suggested reads all input data exactly once

In practice, unless the right matrix (if row-major order, left matrix if column-major order is used) is transposed, don't you still have to skip-read it (NM/P) times?

Quote:

Originally Posted by ta0kira

storing blocks in separate files certainly doesn't reduce the amount of fragmentation. Threading the computations of individual output blocks compounds that problem without having copies of the data on different drives.

The reason for doing it that way is not to avoid data fragmentation on the drives at all. In fact, when done like I described, the data fragmentation does not matter.

The scheme relies on:

each operation on a block matrix takes longer than the average I/O needed to read and write the related blocks
Note that in most cases a lot of I/O is avoided by ordering the block matrix operations.
It gets easier the more block matrices there are.
better control of caching of each block matrix
if the block is not needed in the near future, a caretaker thread can msync(MS_SYNC) it to disk, then madvise(MADV_DONTNEED), then discard the mapping. That evicts it from RAM, including page cache.
preloading new matrices before they are required
using a caretaker thread, it is easiest to achieve by mapping and then accessing the map every sysconf(_SC_PAGE_SIZE) bytes

Let's consider a practical case, to see why there might be a real-world reason why this might be useful. (I'm doing this as I write myself, so I do not know the answer yet when I'm writing this.)

Assume the original matrices to multiply have 32768 rows and columns. Using doubles, each matrix is then 8GB (8589934592) in size. If you have 24GB of RAM available, then you can do this in RAM with minimal I/O, using any number of threads. Let us assume you don't have that much RAM available.

A typical computational cluster node uses RAID-0 for work storage. My own workstation achieves on ext4 typically about 220 MB/s read and 180 MB/s write speeds. Let's simplify things, and assume 200 MB/s either way, and 100 MB/s for both when simultaneous. The asymmetry might be important when considering how the block matrices should be ordered, though; it is a 20% difference, after all. Note that faster drives make the situation more favourable for more I/O; I'm just assuming sensible lower limits for I/O efficiency.

File fragmentation only affects the read speeds when the queues are shallow. When you have a lot of I/O on the way, the I/O subsystem typically manages to do very well. This also applies to reading from a lot of smallish files, instead of few large ones. So, we don't really need to concern ourselves with file fragmentation, as long as the block matrices are large enough. In my experience, 2MB of data is large enough; they read just as fast as larger files (if not faster, overall, due to being discarded from page cache earlier/easier).

The two original 8GB matrices can be split into block matrices in about 82 seconds each, 164 seconds total.

Let's assume a block matrix size of 2048 rows and columns (4M elements, 32MB of data), which of course means there will be 16*16 = 256 block matrices per original matrix.

If you block the two original matrices in parallel -- you can avoid file fragmentation by using posix_fallocate() to allocate each block matrix file beforehand -- you can start the computation about ten to twenty seconds in to the splitting, depending on whether you msync() the block matrices to disk first or not. You only have 34 file descriptors open at the same time, which means you won't risk encountering any limits, either. You will get access to a new row of block matrices every ten seconds or so afterwards.

The naive O(N³) matrix multiplication algorithm on 2048x2048 matrices will take about 350 seconds on a 3GHz Athlon II X4 640. Fortunately, BLAS does it in about 22 seconds. You might be able to shave a couple of seconds off that by using AMD or Intel libraries optimized for your processor. This is only 96MB of data to read (3 block matrices), and 32MB to write. It takes only about half a second to read, and less than a third of a second to write. The CPU work takes more than 20 times longer than the related I/O!

In other words, the I/O you need to do can be done on the background, without it affecting the computation at all. Sure, there is the initial latency of ten to 160 seconds (depending on how you do it), but after that, you are CPU bound, unless you have very slow disks or a lot of CPU cores. All you need to do is to keep the next (threads) operations prepared beforehand, so the worker threads don't need to wait for their data to arrive before starting the work.

Since there are 16*16*16 = 4096 multiplications, and each takes about 22 seconds (on this CPU), the overall CPU time is 90112 seconds, or a bit over 25 hours. Without any optimization, you do 4096*96MB = 393216MB of reading, which takes only about 1966 seconds; and 4096*32MB = 131072MB of writing, which takes only about 655 seconds. Total I/O time at 200 MB/s read or write is therefore just 2621 seconds, a ratio of about 1:34 !

Therefore, if you have comparable CPUs, 200 MB/s read or write speed, and use 2048x2048 block matrices, you can expect to be able to use 32 threads in parallel (if your CPUs have that many cores), and the operation to finish in about 160s + 2800s + 80s ≃ 3000s, or about fifty minutes. With eight threads, about three hours and twelve minutes. (With eight threads, your drives can be as slow as 50 MB/s read or write, and still be CPU limited, not stalling/waiting for I/O at any time.)

Note that being able to use the non-naive matrix multiplication algorithm is the key factor here, decimating the computation time. (Actually, dgemm() for a 2048x2048 matrix seems to take only 1/16th the time the naive matrix multiplication needs, on this CPU.)
I wish I could say I did consider that when first describing this method.. but I didn't. Not consciously, anyway. I knew it, and have even benchmarked it a few years back (comparing naive, matmul(), BLAS, and ACML matrix multiplications, on gfortran), but I did not think about it in this case. I still don't remember the exact results, just that ACML vectorization blew the others out of the water then -- I think GCC is able to do that too now, but I am not certain.

It is quite possible that using even larger matrices would make the process more efficient, although it would drastically increase the memory requirements. On my own machine with 6GB of RAM, upping the block matrix size to 4096x4096 brings only a 2% speedup, while obviously quadrupling the memory use. I'd say 2048x2048 block matrix size looks pretty good to me.

Unless I've made an error in the above, this seems to me to be the obviously best approach. The matrix multiplication times and I/O speeds I have tested carefully, so you can rely on those, on an Athlon II X4 640 and software RAID-0 on 2×Samsung HD103UJ drives using ext4.

Ta0kira, if you could check my logic above, and let me know your opinion, I'd appreciate it very much. I do often make errors, after all, and I might have missed something important above.

ta0kira · 04-21-2012, 10:43 PM

I'm sorry, your post has too much extraneous information for me to fully understand your point. Please isolate your logical arguments or post example code. I do have a few comments, though:

If the difference in performance between having A and A^T in a file (for my method) is actually substantial then you'll have the same problem for blocks. In other words, if fragmentation isn't a problem for blocks, then reading columns from a row-major matrix won't be a problem either.
Each element/block is a sum of unique multiplications; the summation can be done incrementally, but the multiplication cannot be. If you compute each output block in full before moving on to another output block you'll have to keep rereading the input blocks from the files (unless you cache all of them in RAM.) That's just how matrix multiplication works, whether or not you use blocks.
The "faster" matrix multiplication algorithms you reference are not numerically stable.

Here is a working program (C99+POSIX) implementing what I described:

Code:

#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>


typedef double m_value;

typedef struct
{
	unsigned int rows, cols, trans, map_size;
	m_value *data;
} matrix;


static int map_matrix_file(const char *specs, const char *file, int *descriptor, matrix *new_matrix);
static int create_new_matrix(matrix *A, matrix *B, const char *file, int * descriptor, matrix *new_matrix);

static inline unsigned int rows(matrix *M);
static inline unsigned int cols(matrix *M);
static inline m_value *element(matrix *M, unsigned int row, unsigned int col);

static void multiply_matrices(matrix *A, matrix *B, matrix *C, unsigned int P);


int main(int argc, char *argv[])
{
	if (argc != 7)
	{
	fprintf(stderr, "%s (T)[A rows]x[A cols] [A filename] (T)[B rows]x[B cols] [B filename] [# vectors] [C filename]\n", argv[0]);
	return 1;
	}

	const char *const A_file = argv[2], *const B_file = argv[4], *const C_file = argv[6],
	           *const A_spec = argv[1], *const B_spec = argv[3], *const vector_count = argv[5];
	matrix A, B, C;
	int A_fd, B_fd, C_fd;
	unsigned int vectors;


	if (!map_matrix_file(A_spec, A_file, &A_fd, &A)) return 1;
	if (!map_matrix_file(B_spec, B_file, &B_fd, &B)) return 1;
	if (!create_new_matrix(&A, &B, C_file, &C_fd, &C)) return 1;

	if (cols(&A) != rows(&B))
	{
	fprintf(stderr, "inner matrix dimensions are not equal: %u, %u\n", cols(&A), rows(&B));
	return 1;
	}

	if (sscanf(vector_count, "%u", &vectors) != 1 || vectors < 1)
	{
	fprintf(stderr, "invalid vector count '%s'\n", vector_count);
	return 1;
	}

	multiply_matrices(&A, &B, &C, vectors);

	munmap(A.data, A.map_size);
	munmap(B.data, B.map_size);
	munmap(C.data, C.map_size);
	close(A_fd);
	close(B_fd);
	close(C_fd);

	return 0;
}



static int map_matrix_file(const char *specs, const char *file, int *descriptor, matrix *new_matrix)
{
	if      (sscanf(specs, "T%ux%u", &new_matrix->rows, &new_matrix->cols) == 2) new_matrix->trans = 1;
	else if (sscanf(specs, "%ux%u",  &new_matrix->rows, &new_matrix->cols) == 2) new_matrix->trans = 0;
	else
	{
	fprintf(stderr, "invalid matrix specification '%s' (for '%s')\n", specs, file);
	return 0;
	}


	*descriptor = open(file, O_RDONLY);
	if (*descriptor < 0)
	{
	fprintf(stderr, "could not open matrix file '%s': %s\n", file, strerror(errno));
	return 0;
	}

	new_matrix->map_size = new_matrix->rows * new_matrix->cols * sizeof(m_value);
	unsigned int page_size = sysconf(_SC_PAGE_SIZE);
	if (new_matrix->map_size % page_size) new_matrix->map_size = ( (new_matrix->map_size / page_size) + 1 ) * page_size;

	new_matrix->data = (m_value*) mmap(NULL, new_matrix->map_size, PROT_READ, MAP_PRIVATE, *descriptor, 0);
	if (new_matrix->data < 0)
	{
	fprintf(stderr, "could not map matrix file '%s': %s\n", file, strerror(errno));
	close(*descriptor);
	return 0;
	}

	return 1;
}


static int create_new_matrix(matrix *A, matrix *B, const char *file, int * descriptor,
  matrix *new_matrix)
{
	*descriptor = open(file, O_RDWR | O_TRUNC | O_CREAT, 0666);
	if (*descriptor < 0)
	{
	fprintf(stderr, "could not open matrix file '%s': %s\n", file, strerror(errno));
	return 0;
	}

	new_matrix->rows  = rows(A);
	new_matrix->cols  = cols(B);
	new_matrix->trans = 0;

	static const m_value zero = 0;

	lseek(*descriptor, ( new_matrix->rows * new_matrix->cols - 1 ) * sizeof(m_value), SEEK_SET);
	write(*descriptor, &zero, sizeof zero);
	lseek(*descriptor, 0, SEEK_SET);

	new_matrix->map_size = new_matrix->rows * new_matrix->cols * sizeof(m_value);
	unsigned int page_size = sysconf(_SC_PAGE_SIZE);
	if (new_matrix->map_size % page_size) new_matrix->map_size = ( (new_matrix->map_size / page_size) + 1 ) * page_size;

	new_matrix->data = (m_value*) mmap(NULL, new_matrix->map_size, PROT_WRITE, MAP_SHARED, *descriptor, 0);
	if (new_matrix->data < 0)
	{
	fprintf(stderr, "could not map matrix file '%s': %s\n", file, strerror(errno));
	close(*descriptor);
	return 0;
	}

	return 1;
}


static inline unsigned int rows(matrix *M)
{ return (M->trans? M->cols : M->rows); }

static inline unsigned int cols(matrix *M)
{ return (M->trans? M->rows : M->cols); }

static inline m_value *element(matrix *M, unsigned int row, unsigned int col)
{ return M->data + ( M->trans? ( col * M->rows + row ) : ( col + row * M->cols ) ); }


static void multiply_matrices(matrix *A, matrix *B, matrix *C, unsigned int P)
{
	if (P > rows(B)) P = rows(B);
	if (P == 0)      P = 1;

	unsigned int P_i;

	unsigned int A_cache = rows(A) * P, B_cache = cols(B) * P, C_cache = cols(C);

	m_value *A_temp = (m_value*) malloc(A_cache * sizeof(m_value));
	m_value *B_temp = (m_value*) malloc(B_cache * sizeof(m_value));
	m_value *C_temp = (m_value*) malloc(C_cache * sizeof(m_value));

	//attempt to lock the cached data in memory (ignore failure)

	mlock(A_temp, A_cache);
	mlock(B_temp, B_cache);
	mlock(C_temp, C_cache);


	for (int I = 0; P && I < rows(B); I += P)
	{
	P_i = (P > rows(B) - I)? (rows(B) - I) : P;

	//load the next sections of data

	if (A->trans)
	 {
	for (int J = 0; J < P_i; J++)
	for (int K = 0; K < rows(A); K++)
	A_temp[rows(A) * J + K] = *element(A, K, I + J);
	 }

	else
	 {
	for (int K = 0; K < rows(A); K++)
	for (int J = 0; J < P_i; J++)
	A_temp[rows(A) * J + K] = *element(A, K, I + J);
	 }

	if (B->trans)
	 {
	for (int K = 0; K < cols(B); K++)
	for (int J = 0; J < P_i; J++)
	B_temp[cols(B) * J + K] = *element(B, I + J, K);
	 }

	else
	 {
	for (int J = 0; J < P_i; J++)
	for (int K = 0; K < cols(B); K++)
	B_temp[cols(B) * J + K] = *element(B, I + J, K);
	 }


	//compute partial sums for output

	for (int J = 0; J < rows(C); J++)
	 {

	//compute a partial sum for the current output row

	for (int K = 0; K < P_i; K++)
	if (K == 0)
	  {
	for (int L = 0; L < cols(C); L++)
	C_temp[L] = A_temp[rows(A) * K + J] * B_temp[cols(B) * K + L];
	  }

	else
	  {
	for (int L = 0; L < cols(C); L++)
	C_temp[L] += A_temp[rows(A) * K + J] * B_temp[cols(B) * K + L];
	  }

	//add the partial sum to the current output row

	if (I == 0)
	  {
	for (int L = 0; L < cols(C); L++)
	*element(C, J, L) = C_temp[L];
	  }

	else
	  {
	for (int L = 0; L < cols(C); L++)
	*element(C, J, L) += C_temp[L];
	  }
	 }

	}


	munlock(A_temp, A_cache);
	munlock(B_temp, B_cache);
	munlock(C_temp, C_cache);
	free(A_temp);
	free(B_temp);
	free(C_temp);
}

Sorry for the length. Save it as matrix-times.c and compile with gcc -O2 --std=c99 matrix-times.c -o matrix-times. If you had "A.dat" (1024x1024) and "B.dat" (1024x1024), and you wanted to store 100 vectors from each in memory at once with output to "C.dat", you'd call ./matrix-times 1024x1024 A.dat 1024x1024 B.dat 100 C.dat. If you want to use the transpose of either, prepend the dims with "T", e.g. "T1024x1024" (the dims are those of the stored matrix, not of its transpose.) I've assumed that the matrices are stored as row-major C-arrays of double.

As long as the inner dimensions match, it should work properly. I tested it with smaller matrices that I could visually validate (e.g. 5x10 times 10x7) with various cache sizes and that worked well, so I made the assumption that it would scale properly (if I use this in the future I'll perform better validation, though.) I did a few tests with 2048x2048 using different numbers of cached vectors. Here are a few of those results: 256 vectors: 24s; 128 vectors: 22s; 64 vectors: 22s; 32 vectors: 22s; 4 vectors: 26s. CPU is 2.67GHz Intel i7 (dual, not quad.) I also did some 4096x4096 tests: 1024 vectors: 191s; 512 vectors: 194s; 256 vectors: 198s; 128 vectors: 196s.
Kevin Barry