[SOLVED] Parallel matrix - matrix multiplication seg-faults

ejspeiro · 04-13-2011, 11:59 AM

I just can't find... why is my code seg-faulting?

PHP Code:



/* 
  Code name: Assignment #7. 
  File name: slice.c 
  Program name: Matrix multiplication (parallel version). 
  Version: 1. 
  Author: Eduardo Sanchez: esanchez@sciences.sdsu.edu. 
 
  Description: This program implements a proposed scheme to implement an 
  algorithm for the multiplication of two square matrices  as a parallel 
  program. It is intended to function as a program to study the performance 
  metrics considering when evaluating parallel programs. 
   
  Conventions: 
    1. Variables beginning with "my_" are though as local to a process. 
    2. Most of the comments can be done in the first person since it is 
       intuitive to think in terms of which process "you are". 
    3. The order in which the routines are listed and defined reference the 
       hierarchic relation or inclusion relation among them, in the sense that 
       the first prototype listed is the routine that does not depend on any 
       other routine and that will be the first routine to be defined after the 
       definition of the main module. 
*/ 
 
/*______________________________________________________________________ 
 
  Required header files: 
  ______________________________________________________________________*/ 
 
#include <stdio.h>  /* Defines the standard input and output instructions. */ 
#include <stdlib.h> /* Defines the constant for execution end status. */ 
#include <math.h>   /* Used for mathematical functions and constants. */ 
#include <string.h> /* Used for memset functions. */ 
#include <mpi.h>    /* Message Passing Interface implementation. */ 
 
/*______________________________________________________________________ 
 
  Defined constants: 
  ______________________________________________________________________*/ 
 
#define MASTER  0 /* Master process' ID. */ 
 
/*______________________________________________________________________ 
 
  Subroutines: 
  ______________________________________________________________________*/ 
 
/* This routine computes the dot product betwen two vectors passed as arrays of 
   double precision floating point values xx and yy. The dimension N is 
   assumed to be the same for both of them. Mathematically, this operation 
   returns an scalar value here represented as an double floating point 
   variable. */ 
     
double DotProduct (double *xx, double *yy, int NN); 
 
/* Main module: */ 
 
int main (int argc, char **argv) { 
 
  double **my_AA;     /* Local factor matrix AA. */ 
  double **my_BB;     /* Local factor matrix BB. */ 
  double **temp;      /* Temporal buffer for matrix BB, used in broadcast. */ 
  double **my_CC;     /* Local result or product matrix CC. */ 
  double **CC;        /* Global result or product matrix CC. */ 
  double start_time;  /* Start time for the execution of the multiplication. */ 
  double end_time;    /* Ending time for the execution of the multiplication. */ 
  double my_dot_prod; /* Partial result of the dot product. */ 
  int mpi_error;      /* Used to MPI-related calls that may lead to an error. */ 
  int my_rank;        /* Used the store the rank of the process. */ 
  int remainder;      /* Used to in the domain decomposition. It has to be zero 
                         or else, it will indicate a required adjustment in the 
                         amount of rows per process. */ 
  int *rows_per_proc; /* Keeps information on how many rows the procs have. */ 
  int pp;             /* Used to store the number of processes. */ 
  int NN;             /* Dimension of the matrices. */ 
  unsigned int ii;    /* Iterator. */ 
  unsigned int jj;    /* Iterator. */ 
  unsigned int kk;    /* Iterator. */ 
 
  /* Initialize and ensure no error: */ 
  mpi_error = MPI_Init(&argc, &argv); 
  if (mpi_error != MPI_SUCCESS) { 
    if (my_rank == MASTER) 
      printf("ERROR 00: Initialization failed! Exiting..."); 
    return EXIT_FAILURE; 
  } 
   
  /* Parameters of the problem: */ 
  NN = 6; 
  /* Obtain my rank AND number of processes in the global execution: */ 
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); 
  MPI_Comm_size(MPI_COMM_WORLD, &pp); 
  /* Check if parallelization makes sense: */ 
  if (NN < pp) { 
    if (my_rank == MASTER) { 
      printf("ERROR 01: You have actually more processes than rows.\n"); 
      printf("Exiting...\n"); 
    } 
    return EXIT_FAILURE;   
  } 
   
  /* Memory allocations: */ 
   
  /* The first considered memory allocation is the one for the array containing 
     the information of the domain decomposition. This allocatino will be done 
     and then, the actual domain decomposition will be performed. */ 
  rows_per_proc = (int*) malloc(pp*sizeof(int)); 
  if (rows_per_proc == NULL) { 
    if (my_rank == MASTER) { 
      printf("ERROR 02: Row information array could not be allocated!\n"); 
      printf("Exiting...\n"); 
    } 
    return EXIT_FAILURE; 
  } 
  if (my_rank == MASTER) 
    printf("MASTER: I need to repart %i rows among %i processes.\n", NN, pp); 
     
  /* Domain decomposition: */ 
  /* Determine how many rows does each process is going to work with: */ 
  remainder = NN % pp; 
  /* If the number of rows allows an equal distribution of these with respect 
     to the processes, then each process will work with NN/pp rows; but if this 
     division is not even, the remaining rows will be equally distributed among 
     the processes, starting from the first one. */ 
  /* So, every process will first compute the even amount of rows: */ 
  for (ii = 0; ii < pp; ii++) { 
    rows_per_proc[ii] = (NN - remainder)/pp; 
  } 
  /* And 2nd, every process gets its extra row until the remainder is covered. 
     Notice that if the remainder is actually 0, that is, an even division of 
     NN / pp, then this loop is NOT executed. */ 
  for (ii = 0; ii < remainder; ii++) { 
    rows_per_proc[ii]++;  
  } 
  /* The master shows the global array containing the information of the rows 
     per process: */ 
  if (my_rank == MASTER) { 
    printf("MASTER: Rows per processor: "); 
    for (ii = 0; ii < pp; ii++) 
      printf("%i ", rows_per_proc[ii]); 
    printf("\n"); 
  } 
   
  /* Each process creates its pieces of the matrices: */ 
  /* Memory allocation for the matrices. Notice that every allocation is 
     verified. */ 
  /* 1. Memory allocation for my_AA: */ 
  my_AA = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); 
  if (my_AA != NULL) { 
    for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
      my_AA[ii] = (double *) malloc(NN*sizeof(double)); 
    } 
  } else { 
    if (my_rank == MASTER) { 
      printf("ERROR 03: Matrix my_AA could not be allocated!\n"); 
      printf("Exiting...\n"); 
    } 
    return EXIT_FAILURE; 
  } 
  /* Filing my_AA with values: */ 
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) 
    for (jj = 0; jj < NN; jj++) 
      my_AA[ii][jj]= 1.0; 
  /* Printing my_AA: */ 
  printf("PROC %d: My %d rows of my_AA: \n", my_rank, rows_per_proc[my_rank]); 
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
    for (jj = 0; jj < NN; jj++) { 
      printf("%lf ", my_AA[ii][jj]); 
    } 
    printf("\n"); 
  } 
   
  /* 2. Memory allocation for my_BB: */ 
  my_BB = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); 
  if (my_BB != NULL) 
    for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
      my_BB[ii] = (double *) malloc(NN*sizeof(double)); 
    } 
  else { 
    if (my_rank == MASTER) { 
      printf("ERROR 04: Matrix my_BB could not be allocated!\n"); 
      printf("Exiting...\n"); 
    } 
    return EXIT_FAILURE; 
  } 
  /* Filing my_BB with values: */ 
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) 
    for (jj = 0; jj < NN; jj++) 
      my_BB[ii][jj]= pow(-1.0, jj); 
  /* Printing my_BB: */ 
  printf("PROC %d: My %d rows of my_BB: \n", my_rank, rows_per_proc[my_rank]); 
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
    for (jj = 0; jj < NN; jj++) { 
      printf("%lf ", my_BB[ii][jj]); 
    } 
    printf("\n"); 
  } 
   
  /* 3. Memory allocation for temp: */       
  temp = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); 
  if (temp != NULL) 
    for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
      temp[ii] = (double *) malloc(NN*sizeof(double)); 
    } 
  else { 
    if (my_rank == MASTER) { 
      printf("ERROR 05: Matrix temp could not be allocated!\n"); 
      printf("Exiting...\n"); 
    } 
    return EXIT_FAILURE; 
  } 
   
  /* 4. Memory allocation for my_CC: */ 
  my_CC = (double **) malloc(rows_per_proc[my_rank]*sizeof(double)); 
  if (my_CC != NULL) 
    for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
      my_CC[ii] = (double *) malloc(NN*sizeof(double)); 
    } 
  else { 
    if (my_rank == MASTER) { 
      printf("ERROR 06: Matrix my_CC could not be allocated!\n"); 
      printf("Exiting...\n"); 
    } 
    return EXIT_FAILURE; 
  } 
   
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) 
    for (jj = 0; jj < NN; jj++) 
      my_CC[ii][jj]= -1.0; 
       
  printf("PROC %d: My %d rows of my_CC: \n", my_rank, rows_per_proc[my_rank]); 
   
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
    for (jj = 0; jj < NN; jj++) { 
      printf("%lf ", my_CC[ii][jj]); 
    } 
    printf("\n"); 
  } 
   
  /* Once the data is set, start measuring time and compute the product: */ 
  if (my_rank == MASTER) 
    start_time = MPI_Wtime(); 
  /* Each process broadcasts its piece of BB, that is my_BB: */ 
  for (kk = 0; kk < pp; kk++) { 
    if (my_rank == kk) { 
      temp = my_BB; 
      mpi_error = MPI_Bcast(temp, 
                            NN*rows_per_proc[my_rank], 
                            MPI_DOUBLE, 
                            kk, 
                            MPI_COMM_WORLD); 
      if (mpi_error != MPI_SUCCESS) { 
        if (my_rank == MASTER) 
          printf("ERROR 07: Broadcasting of slices of my_BB failed! Exiting!"); 
        return EXIT_FAILURE; 
      } 
      /* Once broadcasted, every process computes the dot product for each of 
         their columns: */ 
      for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
        for (jj = 0; jj < NN; jj++) { 
          my_CC[ii][jj] = DotProduct(my_AA[ii], temp[ii], NN); 
        } 
      } 
    } 
  } 
  printf("PROC %d: My rows of my_CC TOTAL: \n", my_rank); 
  for (ii = 0; ii < rows_per_proc[my_rank]; ii++) { 
    for (jj = 0; jj < NN; jj++) { 
      printf("%lf ", my_CC[ii][jj]); 
    } 
    printf("\n"); 
  } 
  /* Finally, the MASTER process gathers the partial results into a global 
     matrix named CC. Again, WE WILL WAIT FOR THE ALLOCATION OF THIS GLOBAL 
     ARRAY BECAUSE THERE IS NO POINT ON HAVING THE ROOT PROCESS DOING IT, IF IN 
     THE SPECIFIC POINT OF THE EXECUTION OF THE CODE WHERE THE BROADCAST OCCURS, 
     AN ERROR COULD LEAD TO A PREMATURE INTERRUPTION OF THE CODE. Clearly, every 
     other allocation so far considered has to be done in their respective point 
     since they are needed in the computation of the required product. */ 
      
  /* Make sure everyone has finished before calling the gathering: */ 
  MPI_Barrier(MPI_COMM_WORLD); 
   
  if (my_rank == MASTER) { 
   
    /* Memory allocation (and its verification) for CC: */ 
    CC = (double **) malloc(NN*sizeof(double)); 
    if (CC != NULL) { 
      for (ii = 0; ii < NN; ii++) { 
        CC[ii] = (double *) malloc(NN*sizeof(double)); 
      } 
    } else { 
      if (my_rank == MASTER) { 
        printf("ERROR 08: Matrix CC could not be allocated!\n"); 
        printf("Exiting...\n"); 
      } 
      return EXIT_FAILURE; 
    } 
     
    for (ii = 0; ii < NN; ii++) 
      for (jj = 0; jj < NN; jj++) 
        CC[ii][jj] = -1.0; 
         
    printf("\n\nMASTER: CC BEFORE GATHER: \n"); 
    for (ii = 0; ii < NN; ii++) { 
      for (jj = 0; jj < NN; jj++) { 
        printf("%lf ", CC[ii][jj]); 
      } 
      printf("\n"); 
    } 
    printf("\n"); 
  } 
   
  MPI_Gather( my_CC, 
              rows_per_proc[my_rank]*NN, 
              MPI_DOUBLE, 
              CC, 
              rows_per_proc[my_rank]*NN, 
              MPI_DOUBLE, 
              MASTER, 
              MPI_COMM_WORLD); 
               
  if (my_rank == MASTER) { 
    printf("MASTER: CC: \n"); 
    for (ii = 0; ii < NN; ii++) { 
      for (jj = 0; jj < NN; jj++) { 
        printf("%lf ", CC[ii][jj]); 
      } 
      printf("\n"); 
    } 
    printf("\n\n"); 
  } 
  /* Finalize time measure: */ 
  if (my_rank == MASTER) 
    end_time = MPI_Wtime(); 
  /* Finalize MPI AND finalize execution: */ 
  MPI_Finalize();   
  return EXIT_SUCCESS; 
} 
 
/* This routine computes the dot product betwen two vectors passed as arrays of 
   double precision floating point values xx and yy. The dimension N is 
   assumed to be the same for both of them. Mathematically, this operation 
   returns an scalar value here represented as an double floating point 
   variable. */ 
 
double DotProduct (double *xx, double *yy, int NN) { 
 
  unsigned int ii; 
  double aux; 
   
  aux = 0.0; 
  for (ii = 0; ii < NN; ii++ ) { 
    aux = aux + xx[ii]*yy[ii]; 
  } 
  return aux; 
}

Is giving my a seg-fault right at the ending point where I do the final gather from the MASTER processor.

Any help will be welcome!

Nominal Animal · 04-13-2011, 03:37 PM

Quote:

Originally Posted by ejspeiro

I just can't find... why is my code seg-faulting?
Is giving my a seg-fault right at the ending point where I do the final gather from the MASTER processor.

Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.

First, if you comment out the

Code:

  MPI_Gather( my_CC,
              rows_per_proc[my_rank]*NN,
              MPI_DOUBLE,
              CC,
              rows_per_proc[my_rank]*NN,
              MPI_DOUBLE,
              MASTER,
              MPI_COMM_WORLD);

the code won't crash; this confirms your diagnosis on where your program actually crashes.

So, let's check the parameters one by one. See man MPI_Gather for the details.

my_CC
This is the send buffer.
rows_per_proc[my_rank]*NN
This is the number of elements in the send buffer. Umm.. Are you sure about this value?
MPI_DOUBLE
This is the data type of elements in the send buffer.
CC
This is the receive buffer. It's only used in the receiving process, but you really should set it to NULL on others.
If you use warnings when compiling, e.g mpicc -Wall, you'll see the note about CC.
rows_per_proc[my_rank]*NN
This is the number of elements per process in the receive buffer. Umm.. Are you sure about this value?
MPI_DOUBLE
This is the data type of elements in the receive buffer. (MPI can do some data type conversions on the fly.)
MASTER
Receiving process. You've defined this as zero, which means the root process. Quite okay.
MPI_COMM_WORLD
Gather is done from the entire group; okay.

Perhaps you should reconsider your MPI_Gather call. Perhaps you should take a look at the MPI_Gather example first? Pay attention to why the example uses MPI_Gatherv instead. Also, remember that the receive buffer will contain the data from each of the pp processes, including the receive process itself.

Although MPI_Gatherv() is pretty much perfect for this case, you can try to get your program working by replacing it with MPI_Send()s on all but the master process, and corresponding MPI_Recv()s on the master process. If you do that, it would probably help you understand which kind of MPI_Gather()/MPI_Gatherv() call you need to do it all at once.

Hope this helps.

johnsfine · 04-13-2011, 04:16 PM

Quote:

Originally Posted by Nominal Animal

Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.

I don't know MPI myself, but I expected to be able to understand an example this easy from the documentation and help the OP. The documentation is less helpful than I would have expected. There are plenty of details documented, but basic concepts are largely missing or buried.

So I think you could have explained a few of the basic concepts the OP seems to have misunderstood or pointed to some of the key spots in the doc that a beginner might easily miss. That still isn't doing the exercise.

Quote:

So, let's check the parameters one by one. See man MPI_Gather for the details.

Those parameter descriptions are a bit short of understandable, especially recvcount. The OP seems to have misunderstood recvbuf as well.

Nearby that page there is a section much more helpful within
http://www.mpi-forum.org/docs/mpi-11...69.html#Node69

Notice where it says

Code:

The outcome is as if each of the n processes in
the group (including the root process) had executed a call to
MPI.Send(sendbuf,sendcount,sendtype,root,...)
and the root had executed n calls to
MPI.Recv(recvbuf+i*recvcount*extent(recvtype),recvcount,recvtype,i,...)

That gives you a direct understanding of how recvcount and recvbuf are used by MPI_Gather.

Eduardo, is the receive buffer you allocated compatible with MPI_Gather using it as the text I just quoted describes?

Quote:

Although MPI_Gatherv() is pretty much perfect for this case, you can try to get your program working by replacing it with MPI_Send()s on all but the master process, and corresponding MPI_Recv()s on the master process.

That's a very good suggestion.

Quote:

If you do that, it would probably help you understand which kind of MPI_Gather()/MPI_Gatherv() call you need to do it all at once.

Only after you find the documentation that tells you what MPI_Gather()/MPI_Gatherv() really do, rather than the documentation that ambiguously documents their parameters.

Also, all the places the original code says
(double **) malloc( something *sizeof(double))
are incorrect. That should be
(double **) malloc( something *sizeof(double*))
That error just wastes memory. It won't cause any crash nor wrong answer. But it is still an error.

BTW, I haven't looked through the details of your actual matrix product logic, nor do I know best technique for distributing a matrix product. But even trivial thought regarding data flow should tell you it is lame to parallelize the problem by dividing into groups of rows. You should parallelize by dividing the solution matrix into nearly square rectangular chunks.

Nominal Animal · 04-13-2011, 08:21 PM

Quote:

Originally Posted by johnsfine

So I think you could have explained a few of the basic concepts the OP seems to have misunderstood or pointed to some of the key spots in the doc that a beginner might easily miss. That still isn't doing the exercise.

You're right. The OpenMPI manual pages have much better descriptions and explanations of each function, and has examples for most of the functions. There is a very good MPI tutorial at Lawrence Livermoore.

Quote:

Originally Posted by johnsfine

nor do I know best technique for distributing a matrix product. But even trivial thought regarding data flow should tell you it is lame to parallelize the problem by dividing into groups of rows. You should parallelize by dividing the solution matrix into nearly square rectangular chunks.

I agree. You can then distribute either rectangular parts of the left and right matrices, or full rows of the left matrix and full columns of the right matrix. It'd be quite interesting to compare the total costs of each method..

For very small matrices, say < 20x20, I'd just broadcast the entire left and right side matrices to each process, have them do some of the computational work, into a result matrix which is initialized to zeros. Then you can reduce them using sum operation, to get the actual result matrix -- and it does not matter which process has done which part of the work, as long as all of the computations have been done.

It is always a good idea to think about the operation at hand:

result[row][column] = sum over i: left[row][i] * right[i][column]

Note that transposing the right matrix makes the computations a lot more efficient, since the data is then consecutive in memory.

johnsfine · 04-14-2011, 07:40 AM

Quote:

Originally Posted by Nominal Animal

I agree. You can then distribute either rectangular parts of the left and right matrices, or full rows of the left matrix and full columns of the right matrix. It'd be quite interesting to compare the total costs of each method.

I was having trouble visualizing an underlying communication system for MPI efficient enough that there could be any savings in elapsed time vs. having one process do the whole job (matrix multiply). The ratio of work to communication goes up some when the matrix is larger. But still not enough to make sense.

I was thinking of a rectangular division of processes into intersecting teams (of course I don't know how hard teams are to set up in MPI). Given the total number of processes, computer the integer square root, then the integer quotient for the two team dimensions: Example, 23 processes. square root of 23 is 4. 23/4 is 5. So use a 4x5 set of intersecting teams and waste the extra 3 processes.

Each process would be a member of one of four teams of 5 and one of five teams of 4. Each team of 5 would multiply all of the A rows by one quarter of the B columns. Each team of 4 would multiply one fifth of the A rows by all the B columns. So each process acting simultaneously in two teams would multiply one fifth of the A rows times one quarter of the B columns.

Based on the OPs design, each process would first invent data for its twentieth of A rows and its twentieth of B columns.

Then the 5 row teams in parallel would each go through 4 team members in sequence to exchange/keep row data, so each member of a row team then had the rows of the entire team.

Then the 4 column teams in parallel would each go through 5 team members in sequence to exchange column data, and as each chunk of column data was exchanged, each process could multiply it by all the rows of its row team and store those results and discard that column chunk before going on to the next.

Then all the results could be handed to the master, which would get them in a strange sequence (by rectangular chunk, rather than by row). If it just needs to display the results it just needs an accessor function that takes row and column number and grabs the right element from the chunks oriented result. If it needs significant further work, it would need to reorder the result. But that is likely less expensive than using MPI comm enough more to deliver the result in final sequence.

If you add up all the data flow (especially if comm by teams in parallel really is in parallel) you should see enough less total work to more than compensate for idling a few processes.

If the matrix dimension isn't divided evenly by the (adjusted) number of processes, that leads to two level special case setup of dividing the matrix almost evenly by number of teams and dividing the result almost evenly by members per team. That would be twice as tricky as the adjustments the OP got only partially correct in his division by rows in the first post.

ejspeiro · 04-14-2011, 11:26 AM

Nominal Animal:

Quote:

Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.

Yes, off course I appreciate that

Your hint wether if I should use MPI_Gather or not, was really close. I already solved the entire problem and figured out A LOT of my original mistakes. Most of them relied in two great issues:

- Definition of the data structures used for collective communiction calls.
- Memory allocation issues.

Thank you very much for your first post

I am still reading / understanding the other post so I can give a cool answer to all of them

ejspeiro · 04-14-2011, 01:20 PM

johnsfine:

Quote:

The documentation is less helpful than I would have expected. There are plenty of details documented, but basic concepts are largely missing or buried.

I agree u.u that is why I have a book for this and I also used the tutorial from LLNL that Nominal Animal recomended on his second post (awesome recommendation Nominal Animal, thank you for that).

Quick question to johnsfine:

Quote:

(...) and help the OP.

What is the OP?

johnsfine · 04-14-2011, 01:58 PM

Quote:

Originally Posted by ejspeiro

What is the OP?

Original poster. The person who wrote the first post of the current thread (you).

When multiple people answering the same thread start discussing subtopics among themselves, a short hand way to refer back to whoever started the topic is often needed.

Nominal Animal · 04-14-2011, 04:45 PM

Quote:

Originally Posted by johnsfine

I was having trouble visualizing an underlying communication system for MPI efficient enough that there could be any savings in elapsed time vs. having one process do the whole job (matrix multiply). The ratio of work to communication goes up some when the matrix is larger. But still not enough to make sense.

Matrix multiplication is just an example operation here, often used in distributed computing courses because it is easy to understand, but also illustrates well the difficulties in efficient data partitioning (for distributed computing).

There is also the fact that you should always design MPI programs to do computations while communications are in progress, preparing for the next operation (or even better, the operation after that one). That way the cost of the communication is neglible. I have designed certain spatial partitioning schemes to allow this for classical molecular dynamics simulations, including Monte Carlo MD simulations, but there has been very little interest in it. Usually people just throw more hardware -- low-latency infiniband, more computing nodes -- at it rather than fix the logic in their programs. I do understand their concern, however: it does increase complexity, and a lot of programmers (especially physicists) are unfamiliar with concepts like asynchronous I/O. (For spatially distributing Monte Carlo MD simulations I've been thus far unable to retain perfect detail balance, although the error is extremely small. The distribution mechanism does for example totally eliminate all spatial boundary errors.)

Quote:

Originally Posted by johnsfine

I was thinking of a rectangular division of processes into intersecting teams (of course I don't know how hard teams are to set up in MPI).

Extremely easy. You simply agree (between all participating processes in a team) to create a communicator (corresponds to a group of communicating processes) using the MPI_Comm_ functions, and use the team communicators instead of MPI_COMM_WORLD.
There are also MPI_Cart_ functions to create and manage Cartesian communications topologies, so creating the communicators according to your example would be very easy.

Quote:

Originally Posted by johnsfine

So use a 4x5 set of intersecting teams and waste the extra 3 processes.

That is one workable strategy, but note that the wall clock duration of the operation is measured by the slowest team (or member).

It is usually more efficient to assign the work in a smorgasbord fashion: divide the work into much more chunks than there are workers, and have each worker grab the next available chunk, and work on it. The downside is of course the loss of data locality, at least in theory.

In practice, if the master process knows (or the workers tell the master) which sub-matrices the worker asking for more work already has locally, it can use a simple greedy algorithm to minimize the number of sub-matrices transmitted. This is much more robust, since if a worker is delayed for any reason, the system will automatically accommodate that.

Funnily enough, the code is not at all complex when using this kind of smorgasbord or worker pool mechanism (even tends to be quite compact!), but the logic is often hard to grasp for someone unfamiliar with the idea.

Quote:

Originally Posted by johnsfine

Then all the results could be handed to the master, which would get them in a strange sequence

That's actually quite a common thing, and MPI has facilities to handle that gracefully.

For example, you can tag each data chunk with an arbitrary integer. In the receiving end, you can set up asynchronous receives for a specific tag (from a specific process or any process), and then just check or wait until they've all completed. You can define a sparse sub-matrix data type, and use that type to receive the data directly into the correct locations of the result matrix. (If the matrix is a continuous array, you only need one type per each different sub-matrix size. If all the sub-matrices are of the same size, one custom type is sufficient.)

ejspeiro · 04-18-2011, 09:41 PM

johnsfine:

Now that I have finished the exercise entirely, I realized were my errors were! For example, that error in the dynamic memory allocation was enoying and I actually fixed it, so thank you.

I read your proposition to a parallel aproach for the domain decomposition of this problem in post number 5 and yes, it is awesome! The reason I did it the way I did is because the instructor wanted us to do it, specifically like that, so I decided to go with the flow

Nominal Animal:

Quote:

Note that transposing the right matrix makes the computations a lot more efficient, since the data is then consecutive in memory.

Yes! I actually implemented to store the transposed of the matrix B and wrote a serial version of the code, for both of them, transposed and not transposed... the execution times decreases dramatically for matrix sizes up to 4000 x 4000.

I would like to thank both of you guys since your suggestions really taugh me a lot!

For me, this assignment was really illustrative since I also learned how to use gdb for debugging in parallel environments, which results very useful when the problems get really complicated.

Regards!