ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Description: This program implements a proposed scheme to implement an algorithm for the multiplication of two square matrices as a parallel program. It is intended to function as a program to study the performance metrics considering when evaluating parallel programs.
Conventions: 1. Variables beginning with "my_" are though as local to a process. 2. Most of the comments can be done in the first person since it is intuitive to think in terms of which process "you are". 3. The order in which the routines are listed and defined reference the hierarchic relation or inclusion relation among them, in the sense that the first prototype listed is the routine that does not depend on any other routine and that will be the first routine to be defined after the definition of the main module. */
#include <stdio.h> /* Defines the standard input and output instructions. */ #include <stdlib.h> /* Defines the constant for execution end status. */ #include <math.h> /* Used for mathematical functions and constants. */ #include <string.h> /* Used for memset functions. */ #include <mpi.h> /* Message Passing Interface implementation. */
/* This routine computes the dot product betwen two vectors passed as arrays of double precision floating point values xx and yy. The dimension N is assumed to be the same for both of them. Mathematically, this operation returns an scalar value here represented as an double floating point variable. */
double DotProduct (double *xx, double *yy, int NN);
/* Main module: */
int main (int argc, char **argv) {
double **my_AA; /* Local factor matrix AA. */ double **my_BB; /* Local factor matrix BB. */ double **temp; /* Temporal buffer for matrix BB, used in broadcast. */ double **my_CC; /* Local result or product matrix CC. */ double **CC; /* Global result or product matrix CC. */ double start_time; /* Start time for the execution of the multiplication. */ double end_time; /* Ending time for the execution of the multiplication. */ double my_dot_prod; /* Partial result of the dot product. */ int mpi_error; /* Used to MPI-related calls that may lead to an error. */ int my_rank; /* Used the store the rank of the process. */ int remainder; /* Used to in the domain decomposition. It has to be zero or else, it will indicate a required adjustment in the amount of rows per process. */ int *rows_per_proc; /* Keeps information on how many rows the procs have. */ int pp; /* Used to store the number of processes. */ int NN; /* Dimension of the matrices. */ unsigned int ii; /* Iterator. */ unsigned int jj; /* Iterator. */ unsigned int kk; /* Iterator. */
/* Initialize and ensure no error: */ mpi_error = MPI_Init(&argc, &argv); if (mpi_error != MPI_SUCCESS) { if (my_rank == MASTER) printf("ERROR 00: Initialization failed! Exiting..."); return EXIT_FAILURE; }
/* Parameters of the problem: */ NN = 6; /* Obtain my rank AND number of processes in the global execution: */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &pp); /* Check if parallelization makes sense: */ if (NN < pp) { if (my_rank == MASTER) { printf("ERROR 01: You have actually more processes than rows.\n"); printf("Exiting...\n"); } return EXIT_FAILURE; }
/* Memory allocations: */
/* The first considered memory allocation is the one for the array containing the information of the domain decomposition. This allocatino will be done and then, the actual domain decomposition will be performed. */ rows_per_proc = (int*) malloc(pp*sizeof(int)); if (rows_per_proc == NULL) { if (my_rank == MASTER) { printf("ERROR 02: Row information array could not be allocated!\n"); printf("Exiting...\n"); } return EXIT_FAILURE; } if (my_rank == MASTER) printf("MASTER: I need to repart %i rows among %i processes.\n", NN, pp);
/* Domain decomposition: */ /* Determine how many rows does each process is going to work with: */ remainder = NN % pp; /* If the number of rows allows an equal distribution of these with respect to the processes, then each process will work with NN/pp rows; but if this division is not even, the remaining rows will be equally distributed among the processes, starting from the first one. */ /* So, every process will first compute the even amount of rows: */ for (ii = 0; ii < pp; ii++) { rows_per_proc[ii] = (NN - remainder)/pp; } /* And 2nd, every process gets its extra row until the remainder is covered. Notice that if the remainder is actually 0, that is, an even division of NN / pp, then this loop is NOT executed. */ for (ii = 0; ii < remainder; ii++) { rows_per_proc[ii]++; } /* The master shows the global array containing the information of the rows per process: */ if (my_rank == MASTER) { printf("MASTER: Rows per processor: "); for (ii = 0; ii < pp; ii++) printf("%i ", rows_per_proc[ii]); printf("\n"); }
/* Each process creates its pieces of the matrices: */ /* Memory allocation for the matrices. Notice that every allocation is verified. */ /* 1. Memory allocation for my_AA: */ my_AA = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); if (my_AA != NULL) { for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { my_AA[ii] = (double *) malloc(NN*sizeof(double)); } } else { if (my_rank == MASTER) { printf("ERROR 03: Matrix my_AA could not be allocated!\n"); printf("Exiting...\n"); } return EXIT_FAILURE; } /* Filing my_AA with values: */ for (ii = 0; ii < rows_per_proc[my_rank]; ii++) for (jj = 0; jj < NN; jj++) my_AA[ii][jj]= 1.0; /* Printing my_AA: */ printf("PROC %d: My %d rows of my_AA: \n", my_rank, rows_per_proc[my_rank]); for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { for (jj = 0; jj < NN; jj++) { printf("%lf ", my_AA[ii][jj]); } printf("\n"); }
/* 2. Memory allocation for my_BB: */ my_BB = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); if (my_BB != NULL) for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { my_BB[ii] = (double *) malloc(NN*sizeof(double)); } else { if (my_rank == MASTER) { printf("ERROR 04: Matrix my_BB could not be allocated!\n"); printf("Exiting...\n"); } return EXIT_FAILURE; } /* Filing my_BB with values: */ for (ii = 0; ii < rows_per_proc[my_rank]; ii++) for (jj = 0; jj < NN; jj++) my_BB[ii][jj]= pow(-1.0, jj); /* Printing my_BB: */ printf("PROC %d: My %d rows of my_BB: \n", my_rank, rows_per_proc[my_rank]); for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { for (jj = 0; jj < NN; jj++) { printf("%lf ", my_BB[ii][jj]); } printf("\n"); }
/* 3. Memory allocation for temp: */ temp = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); if (temp != NULL) for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { temp[ii] = (double *) malloc(NN*sizeof(double)); } else { if (my_rank == MASTER) { printf("ERROR 05: Matrix temp could not be allocated!\n"); printf("Exiting...\n"); } return EXIT_FAILURE; }
/* 4. Memory allocation for my_CC: */ my_CC = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); if (my_CC != NULL) for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { my_CC[ii] = (double *) malloc(NN*sizeof(double)); } else { if (my_rank == MASTER) { printf("ERROR 06: Matrix my_CC could not be allocated!\n"); printf("Exiting...\n"); } return EXIT_FAILURE; }
for (ii = 0; ii < rows_per_proc[my_rank]; ii++) for (jj = 0; jj < NN; jj++) my_CC[ii][jj]= -1.0;
printf("PROC %d: My %d rows of my_CC: \n", my_rank, rows_per_proc[my_rank]);
for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { for (jj = 0; jj < NN; jj++) { printf("%lf ", my_CC[ii][jj]); } printf("\n"); }
/* Once the data is set, start measuring time and compute the product: */ if (my_rank == MASTER) start_time = MPI_Wtime(); /* Each process broadcasts its piece of BB, that is my_BB: */ for (kk = 0; kk < pp; kk++) { if (my_rank == kk) { temp = my_BB; mpi_error = MPI_Bcast(temp, NN*rows_per_proc[my_rank], MPI_DOUBLE, kk, MPI_COMM_WORLD); if (mpi_error != MPI_SUCCESS) { if (my_rank == MASTER) printf("ERROR 07: Broadcasting of slices of my_BB failed! Exiting!"); return EXIT_FAILURE; } /* Once broadcasted, every process computes the dot product for each of their columns: */ for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { for (jj = 0; jj < NN; jj++) { my_CC[ii][jj] = DotProduct(my_AA[ii], temp[ii], NN); } } } } printf("PROC %d: My rows of my_CC TOTAL: \n", my_rank); for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { for (jj = 0; jj < NN; jj++) { printf("%lf ", my_CC[ii][jj]); } printf("\n"); } /* Finally, the MASTER process gathers the partial results into a global matrix named CC. Again, WE WILL WAIT FOR THE ALLOCATION OF THIS GLOBAL ARRAY BECAUSE THERE IS NO POINT ON HAVING THE ROOT PROCESS DOING IT, IF IN THE SPECIFIC POINT OF THE EXECUTION OF THE CODE WHERE THE BROADCAST OCCURS, AN ERROR COULD LEAD TO A PREMATURE INTERRUPTION OF THE CODE. Clearly, every other allocation so far considered has to be done in their respective point since they are needed in the computation of the required product. */
/* Make sure everyone has finished before calling the gathering: */ MPI_Barrier(MPI_COMM_WORLD);
if (my_rank == MASTER) {
/* Memory allocation (and its verification) for CC: */ CC = (double **) malloc(NN*sizeof(double)); if (CC != NULL) { for (ii = 0; ii < NN; ii++) { CC[ii] = (double *) malloc(NN*sizeof(double)); } } else { if (my_rank == MASTER) { printf("ERROR 08: Matrix CC could not be allocated!\n"); printf("Exiting...\n"); } return EXIT_FAILURE; }
for (ii = 0; ii < NN; ii++) for (jj = 0; jj < NN; jj++) CC[ii][jj] = -1.0;
printf("\n\nMASTER: CC BEFORE GATHER: \n"); for (ii = 0; ii < NN; ii++) { for (jj = 0; jj < NN; jj++) { printf("%lf ", CC[ii][jj]); } printf("\n"); } printf("\n"); }
if (my_rank == MASTER) { printf("MASTER: CC: \n"); for (ii = 0; ii < NN; ii++) { for (jj = 0; jj < NN; jj++) { printf("%lf ", CC[ii][jj]); } printf("\n"); } printf("\n\n"); } /* Finalize time measure: */ if (my_rank == MASTER) end_time = MPI_Wtime(); /* Finalize MPI AND finalize execution: */ MPI_Finalize(); return EXIT_SUCCESS; }
/* This routine computes the dot product betwen two vectors passed as arrays of double precision floating point values xx and yy. The dimension N is assumed to be the same for both of them. Mathematically, this operation returns an scalar value here represented as an double floating point variable. */
double DotProduct (double *xx, double *yy, int NN) {
unsigned int ii; double aux;
aux = 0.0; for (ii = 0; ii < NN; ii++ ) { aux = aux + xx[ii]*yy[ii]; } return aux; }
Is giving my a seg-fault right at the ending point where I do the final gather from the MASTER processor.
I just can't find... why is my code seg-faulting?
Is giving my a seg-fault right at the ending point where I do the final gather from the MASTER processor.
Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.
the code won't crash; this confirms your diagnosis on where your program actually crashes.
So, let's check the parameters one by one. See man MPI_Gather for the details.
my_CC
This is the send buffer.
rows_per_proc[my_rank]*NN
This is the number of elements in the send buffer. Umm.. Are you sure about this value?
MPI_DOUBLE
This is the data type of elements in the send buffer.
CC
This is the receive buffer. It's only used in the receiving process, but you really should set it to NULL on others.
If you use warnings when compiling, e.g mpicc -Wall, you'll see the note about CC.
rows_per_proc[my_rank]*NN
This is the number of elements per process in the receive buffer. Umm.. Are you sure about this value?
MPI_DOUBLE
This is the data type of elements in the receive buffer. (MPI can do some data type conversions on the fly.)
MASTER
Receiving process. You've defined this as zero, which means the root process. Quite okay.
MPI_COMM_WORLD
Gather is done from the entire group; okay.
Perhaps you should reconsider your MPI_Gather call. Perhaps you should take a look at the MPI_Gather example first? Pay attention to why the example uses MPI_Gatherv instead. Also, remember that the receive buffer will contain the data from each of the pp processes, including the receive process itself.
Although MPI_Gatherv() is pretty much perfect for this case, you can try to get your program working by replacing it with MPI_Send()s on all but the master process, and corresponding MPI_Recv()s on the master process. If you do that, it would probably help you understand which kind of MPI_Gather()/MPI_Gatherv() call you need to do it all at once.
Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.
I don't know MPI myself, but I expected to be able to understand an example this easy from the documentation and help the OP. The documentation is less helpful than I would have expected. There are plenty of details documented, but basic concepts are largely missing or buried.
So I think you could have explained a few of the basic concepts the OP seems to have misunderstood or pointed to some of the key spots in the doc that a beginner might easily miss. That still isn't doing the exercise.
Quote:
So, let's check the parameters one by one. See man MPI_Gather for the details.
Those parameter descriptions are a bit short of understandable, especially recvcount. The OP seems to have misunderstood recvbuf as well.
The outcome is as if each of the n processes in
the group (including the root process) had executed a call to
MPI.Send(sendbuf,sendcount,sendtype,root,...)
and the root had executed n calls to
MPI.Recv(recvbuf+i*recvcount*extent(recvtype),recvcount,recvtype,i,...)
That gives you a direct understanding of how recvcount and recvbuf are used by MPI_Gather.
Eduardo, is the receive buffer you allocated compatible with MPI_Gather using it as the text I just quoted describes?
Quote:
Although MPI_Gatherv() is pretty much perfect for this case, you can try to get your program working by replacing it with MPI_Send()s on all but the master process, and corresponding MPI_Recv()s on the master process.
That's a very good suggestion.
Quote:
If you do that, it would probably help you understand which kind of MPI_Gather()/MPI_Gatherv() call you need to do it all at once.
Only after you find the documentation that tells you what MPI_Gather()/MPI_Gatherv() really do, rather than the documentation that ambiguously documents their parameters.
Also, all the places the original code says
(double **) malloc( something *sizeof(double))
are incorrect. That should be
(double **) malloc( something *sizeof(double*))
That error just wastes memory. It won't cause any crash nor wrong answer. But it is still an error.
BTW, I haven't looked through the details of your actual matrix product logic, nor do I know best technique for distributing a matrix product. But even trivial thought regarding data flow should tell you it is lame to parallelize the problem by dividing into groups of rows. You should parallelize by dividing the solution matrix into nearly square rectangular chunks.
So I think you could have explained a few of the basic concepts the OP seems to have misunderstood or pointed to some of the key spots in the doc that a beginner might easily miss. That still isn't doing the exercise.
nor do I know best technique for distributing a matrix product. But even trivial thought regarding data flow should tell you it is lame to parallelize the problem by dividing into groups of rows. You should parallelize by dividing the solution matrix into nearly square rectangular chunks.
I agree. You can then distribute either rectangular parts of the left and right matrices, or full rows of the left matrix and full columns of the right matrix. It'd be quite interesting to compare the total costs of each method..
For very small matrices, say < 20x20, I'd just broadcast the entire left and right side matrices to each process, have them do some of the computational work, into a result matrix which is initialized to zeros. Then you can reduce them using sum operation, to get the actual result matrix -- and it does not matter which process has done which part of the work, as long as all of the computations have been done.
It is always a good idea to think about the operation at hand:
result[row][column] = sum over i: left[row][i] * right[i][column]
Note that transposing the right matrix makes the computations a lot more efficient, since the data is then consecutive in memory.
I agree. You can then distribute either rectangular parts of the left and right matrices, or full rows of the left matrix and full columns of the right matrix. It'd be quite interesting to compare the total costs of each method.
I was having trouble visualizing an underlying communication system for MPI efficient enough that there could be any savings in elapsed time vs. having one process do the whole job (matrix multiply). The ratio of work to communication goes up some when the matrix is larger. But still not enough to make sense.
I was thinking of a rectangular division of processes into intersecting teams (of course I don't know how hard teams are to set up in MPI). Given the total number of processes, computer the integer square root, then the integer quotient for the two team dimensions: Example, 23 processes. square root of 23 is 4. 23/4 is 5. So use a 4x5 set of intersecting teams and waste the extra 3 processes.
Each process would be a member of one of four teams of 5 and one of five teams of 4. Each team of 5 would multiply all of the A rows by one quarter of the B columns. Each team of 4 would multiply one fifth of the A rows by all the B columns. So each process acting simultaneously in two teams would multiply one fifth of the A rows times one quarter of the B columns.
Based on the OPs design, each process would first invent data for its twentieth of A rows and its twentieth of B columns.
Then the 5 row teams in parallel would each go through 4 team members in sequence to exchange/keep row data, so each member of a row team then had the rows of the entire team.
Then the 4 column teams in parallel would each go through 5 team members in sequence to exchange column data, and as each chunk of column data was exchanged, each process could multiply it by all the rows of its row team and store those results and discard that column chunk before going on to the next.
Then all the results could be handed to the master, which would get them in a strange sequence (by rectangular chunk, rather than by row). If it just needs to display the results it just needs an accessor function that takes row and column number and grabs the right element from the chunks oriented result. If it needs significant further work, it would need to reorder the result. But that is likely less expensive than using MPI comm enough more to deliver the result in final sequence.
If you add up all the data flow (especially if comm by teams in parallel really is in parallel) you should see enough less total work to more than compensate for idling a few processes.
If the matrix dimension isn't divided evenly by the (adjusted) number of processes, that leads to two level special case setup of dividing the matrix almost evenly by number of teams and dividing the result almost evenly by members per team. That would be twice as tricky as the adjustments the OP got only partially correct in his division by rows in the first post.
Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.
Yes, off course I appreciate that
Your hint wether if I should use MPI_Gather or not, was really close. I already solved the entire problem and figured out A LOT of my original mistakes. Most of them relied in two great issues:
- Definition of the data structures used for collective communiction calls.
- Memory allocation issues.
Thank you very much for your first post
I am still reading / understanding the other post so I can give a cool answer to all of them
The documentation is less helpful than I would have expected. There are plenty of details documented, but basic concepts are largely missing or buried.
Original poster. The person who wrote the first post of the current thread (you).
When multiple people answering the same thread start discussing subtopics among themselves, a short hand way to refer back to whoever started the topic is often needed.
I was having trouble visualizing an underlying communication system for MPI efficient enough that there could be any savings in elapsed time vs. having one process do the whole job (matrix multiply). The ratio of work to communication goes up some when the matrix is larger. But still not enough to make sense.
Matrix multiplication is just an example operation here, often used in distributed computing courses because it is easy to understand, but also illustrates well the difficulties in efficient data partitioning (for distributed computing).
There is also the fact that you should always design MPI programs to do computations while communications are in progress, preparing for the next operation (or even better, the operation after that one). That way the cost of the communication is neglible. I have designed certain spatial partitioning schemes to allow this for classical molecular dynamics simulations, including Monte Carlo MD simulations, but there has been very little interest in it. Usually people just throw more hardware -- low-latency infiniband, more computing nodes -- at it rather than fix the logic in their programs. I do understand their concern, however: it does increase complexity, and a lot of programmers (especially physicists) are unfamiliar with concepts like asynchronous I/O. (For spatially distributing Monte Carlo MD simulations I've been thus far unable to retain perfect detail balance, although the error is extremely small. The distribution mechanism does for example totally eliminate all spatial boundary errors.)
Quote:
Originally Posted by johnsfine
I was thinking of a rectangular division of processes into intersecting teams (of course I don't know how hard teams are to set up in MPI).
Extremely easy. You simply agree (between all participating processes in a team) to create a communicator (corresponds to a group of communicating processes) using the MPI_Comm_ functions, and use the team communicators instead of MPI_COMM_WORLD.
There are also MPI_Cart_ functions to create and manage Cartesian communications topologies, so creating the communicators according to your example would be very easy.
Quote:
Originally Posted by johnsfine
So use a 4x5 set of intersecting teams and waste the extra 3 processes.
That is one workable strategy, but note that the wall clock duration of the operation is measured by the slowest team (or member).
It is usually more efficient to assign the work in a smorgasbord fashion: divide the work into much more chunks than there are workers, and have each worker grab the next available chunk, and work on it. The downside is of course the loss of data locality, at least in theory.
In practice, if the master process knows (or the workers tell the master) which sub-matrices the worker asking for more work already has locally, it can use a simple greedy algorithm to minimize the number of sub-matrices transmitted. This is much more robust, since if a worker is delayed for any reason, the system will automatically accommodate that.
Funnily enough, the code is not at all complex when using this kind of smorgasbord or worker pool mechanism (even tends to be quite compact!), but the logic is often hard to grasp for someone unfamiliar with the idea.
Quote:
Originally Posted by johnsfine
Then all the results could be handed to the master, which would get them in a strange sequence
That's actually quite a common thing, and MPI has facilities to handle that gracefully.
For example, you can tag each data chunk with an arbitrary integer. In the receiving end, you can set up asynchronous receives for a specific tag (from a specific process or any process), and then just check or wait until they've all completed. You can define a sparse sub-matrix data type, and use that type to receive the data directly into the correct locations of the result matrix. (If the matrix is a continuous array, you only need one type per each different sub-matrix size. If all the sub-matrices are of the same size, one custom type is sufficient.)
Now that I have finished the exercise entirely, I realized were my errors were! For example, that error in the dynamic memory allocation was enoying and I actually fixed it, so thank you.
I read your proposition to a parallel aproach for the domain decomposition of this problem in post number 5 and yes, it is awesome! The reason I did it the way I did is because the instructor wanted us to do it, specifically like that, so I decided to go with the flow
Nominal Animal:
Quote:
Note that transposing the right matrix makes the computations a lot more efficient, since the data is then consecutive in memory.
Yes! I actually implemented to store the transposed of the matrix B and wrote a serial version of the code, for both of them, transposed and not transposed... the execution times decreases dramatically for matrix sizes up to 4000 x 4000.
I would like to thank both of you guys since your suggestions really taugh me a lot!
For me, this assignment was really illustrative since I also learned how to use gdb for debugging in parallel environments, which results very useful when the problems get really complicated.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.