LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-13-2011, 11:59 AM   #1
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Rep: Reputation: 26
Parallel matrix - matrix multiplication seg-faults


I just can't find... why is my code seg-faulting?

PHP Code:
/*
  Code name: Assignment #7.
  File name: slice.c
  Program name: Matrix multiplication (parallel version).
  Version: 1.
  Author: Eduardo Sanchez: esanchez@sciences.sdsu.edu.

  Description: This program implements a proposed scheme to implement an
  algorithm for the multiplication of two square matrices  as a parallel
  program. It is intended to function as a program to study the performance
  metrics considering when evaluating parallel programs.
  
  Conventions:
    1. Variables beginning with "my_" are though as local to a process.
    2. Most of the comments can be done in the first person since it is
       intuitive to think in terms of which process "you are".
    3. The order in which the routines are listed and defined reference the
       hierarchic relation or inclusion relation among them, in the sense that
       the first prototype listed is the routine that does not depend on any
       other routine and that will be the first routine to be defined after the
       definition of the main module.
*/

/*______________________________________________________________________

  Required header files:
  ______________________________________________________________________*/

#include <stdio.h>  /* Defines the standard input and output instructions. */
#include <stdlib.h> /* Defines the constant for execution end status. */
#include <math.h>   /* Used for mathematical functions and constants. */
#include <string.h> /* Used for memset functions. */
#include <mpi.h>    /* Message Passing Interface implementation. */

/*______________________________________________________________________

  Defined constants:
  ______________________________________________________________________*/

#define MASTER  0 /* Master process' ID. */

/*______________________________________________________________________

  Subroutines:
  ______________________________________________________________________*/

/* This routine computes the dot product betwen two vectors passed as arrays of
   double precision floating point values xx and yy. The dimension N is
   assumed to be the same for both of them. Mathematically, this operation
   returns an scalar value here represented as an double floating point
   variable. */
    
double DotProduct (double *xxdouble *yyint NN);

/* Main module: */

int main (int argcchar **argv) {

  
double **my_AA;     /* Local factor matrix AA. */
  
double **my_BB;     /* Local factor matrix BB. */
  
double **temp;      /* Temporal buffer for matrix BB, used in broadcast. */
  
double **my_CC;     /* Local result or product matrix CC. */
  
double **CC;        /* Global result or product matrix CC. */
  
double start_time;  /* Start time for the execution of the multiplication. */
  
double end_time;    /* Ending time for the execution of the multiplication. */
  
double my_dot_prod/* Partial result of the dot product. */
  
int mpi_error;      /* Used to MPI-related calls that may lead to an error. */
  
int my_rank;        /* Used the store the rank of the process. */
  
int remainder;      /* Used to in the domain decomposition. It has to be zero
                         or else, it will indicate a required adjustment in the
                         amount of rows per process. */
  
int *rows_per_proc/* Keeps information on how many rows the procs have. */
  
int pp;             /* Used to store the number of processes. */
  
int NN;             /* Dimension of the matrices. */
  
unsigned int ii;    /* Iterator. */
  
unsigned int jj;    /* Iterator. */
  
unsigned int kk;    /* Iterator. */

  /* Initialize and ensure no error: */
  
mpi_error MPI_Init(&argc, &argv);
  if (
mpi_error != MPI_SUCCESS) {
    if (
my_rank == MASTER)
      
printf("ERROR 00: Initialization failed! Exiting...");
    return 
EXIT_FAILURE;
  }
  
  
/* Parameters of the problem: */
  
NN 6;
  
/* Obtain my rank AND number of processes in the global execution: */
  
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
  
MPI_Comm_size(MPI_COMM_WORLD, &pp);
  
/* Check if parallelization makes sense: */
  
if (NN pp) {
    if (
my_rank == MASTER) {
      
printf("ERROR 01: You have actually more processes than rows.\n");
      
printf("Exiting...\n");
    }
    return 
EXIT_FAILURE;  
  }
  
  
/* Memory allocations: */
  
  /* The first considered memory allocation is the one for the array containing
     the information of the domain decomposition. This allocatino will be done
     and then, the actual domain decomposition will be performed. */
  
rows_per_proc = (int*) malloc(pp*sizeof(int));
  if (
rows_per_proc == NULL) {
    if (
my_rank == MASTER) {
      
printf("ERROR 02: Row information array could not be allocated!\n");
      
printf("Exiting...\n");
    }
    return 
EXIT_FAILURE;
  }
  if (
my_rank == MASTER)
    
printf("MASTER: I need to repart %i rows among %i processes.\n"NNpp);
    
  
/* Domain decomposition: */
  /* Determine how many rows does each process is going to work with: */
  
remainder NN pp;
  
/* If the number of rows allows an equal distribution of these with respect
     to the processes, then each process will work with NN/pp rows; but if this
     division is not even, the remaining rows will be equally distributed among
     the processes, starting from the first one. */
  /* So, every process will first compute the even amount of rows: */
  
for (ii 0ii ppii++) {
    
rows_per_proc[ii] = (NN remainder)/pp;
  }
  
/* And 2nd, every process gets its extra row until the remainder is covered.
     Notice that if the remainder is actually 0, that is, an even division of
     NN / pp, then this loop is NOT executed. */
  
for (ii 0ii remainderii++) {
    
rows_per_proc[ii]++; 
  }
  
/* The master shows the global array containing the information of the rows
     per process: */
  
if (my_rank == MASTER) {
    
printf("MASTER: Rows per processor: ");
    for (
ii 0ii ppii++)
      
printf("%i "rows_per_proc[ii]);
    
printf("\n");
  }
  
  
/* Each process creates its pieces of the matrices: */
  /* Memory allocation for the matrices. Notice that every allocation is
     verified. */
  /* 1. Memory allocation for my_AA: */
  
my_AA = (double **) malloc(rows_per_proc[my_rank]*sizeof(double));
  if (
my_AA != NULL) {
    for (
ii 0ii rows_per_proc[my_rank]; ii++) {
      
my_AA[ii] = (double *) malloc(NN*sizeof(double));
    }
  } else {
    if (
my_rank == MASTER) {
      
printf("ERROR 03: Matrix my_AA could not be allocated!\n");
      
printf("Exiting...\n");
    }
    return 
EXIT_FAILURE;
  }
  
/* Filing my_AA with values: */
  
for (ii 0ii rows_per_proc[my_rank]; ii++)
    for (
jj 0jj NNjj++)
      
my_AA[ii][jj]= 1.0;
  
/* Printing my_AA: */
  
printf("PROC %d: My %d rows of my_AA: \n"my_rankrows_per_proc[my_rank]);
  for (
ii 0ii rows_per_proc[my_rank]; ii++) {
    for (
jj 0jj NNjj++) {
      
printf("%lf "my_AA[ii][jj]);
    }
    
printf("\n");
  }
  
  
/* 2. Memory allocation for my_BB: */
  
my_BB = (double **) malloc(rows_per_proc[my_rank]*sizeof(double));
  if (
my_BB != NULL)
    for (
ii 0ii rows_per_proc[my_rank]; ii++) {
      
my_BB[ii] = (double *) malloc(NN*sizeof(double));
    }
  else {
    if (
my_rank == MASTER) {
      
printf("ERROR 04: Matrix my_BB could not be allocated!\n");
      
printf("Exiting...\n");
    }
    return 
EXIT_FAILURE;
  }
  
/* Filing my_BB with values: */
  
for (ii 0ii rows_per_proc[my_rank]; ii++)
    for (
jj 0jj NNjj++)
      
my_BB[ii][jj]= pow(-1.0jj);
  
/* Printing my_BB: */
  
printf("PROC %d: My %d rows of my_BB: \n"my_rankrows_per_proc[my_rank]);
  for (
ii 0ii rows_per_proc[my_rank]; ii++) {
    for (
jj 0jj NNjj++) {
      
printf("%lf "my_BB[ii][jj]);
    }
    
printf("\n");
  }
  
  
/* 3. Memory allocation for temp: */      
  
temp = (double **) malloc(rows_per_proc[my_rank]*sizeof(double));
  if (
temp != NULL)
    for (
ii 0ii rows_per_proc[my_rank]; ii++) {
      
temp[ii] = (double *) malloc(NN*sizeof(double));
    }
  else {
    if (
my_rank == MASTER) {
      
printf("ERROR 05: Matrix temp could not be allocated!\n");
      
printf("Exiting...\n");
    }
    return 
EXIT_FAILURE;
  }
  
  
/* 4. Memory allocation for my_CC: */
  
my_CC = (double **) malloc(rows_per_proc[my_rank]*sizeof(double));
  if (
my_CC != NULL)
    for (
ii 0ii rows_per_proc[my_rank]; ii++) {
      
my_CC[ii] = (double *) malloc(NN*sizeof(double));
    }
  else {
    if (
my_rank == MASTER) {
      
printf("ERROR 06: Matrix my_CC could not be allocated!\n");
      
printf("Exiting...\n");
    }
    return 
EXIT_FAILURE;
  }
  
  for (
ii 0ii rows_per_proc[my_rank]; ii++)
    for (
jj 0jj NNjj++)
      
my_CC[ii][jj]= -1.0;
      
  
printf("PROC %d: My %d rows of my_CC: \n"my_rankrows_per_proc[my_rank]);
  
  for (
ii 0ii rows_per_proc[my_rank]; ii++) {
    for (
jj 0jj NNjj++) {
      
printf("%lf "my_CC[ii][jj]);
    }
    
printf("\n");
  }
  
  
/* Once the data is set, start measuring time and compute the product: */
  
if (my_rank == MASTER)
    
start_time MPI_Wtime();
  
/* Each process broadcasts its piece of BB, that is my_BB: */
  
for (kk 0kk ppkk++) {
    if (
my_rank == kk) {
      
temp my_BB;
      
mpi_error MPI_Bcast(temp,
                            
NN*rows_per_proc[my_rank],
                            
MPI_DOUBLE,
                            
kk,
                            
MPI_COMM_WORLD);
      if (
mpi_error != MPI_SUCCESS) {
        if (
my_rank == MASTER)
          
printf("ERROR 07: Broadcasting of slices of my_BB failed! Exiting!");
        return 
EXIT_FAILURE;
      }
      
/* Once broadcasted, every process computes the dot product for each of
         their columns: */
      
for (ii 0ii rows_per_proc[my_rank]; ii++) {
        for (
jj 0jj NNjj++) {
          
my_CC[ii][jj] = DotProduct(my_AA[ii], temp[ii], NN);
        }
      }
    }
  }
  
printf("PROC %d: My rows of my_CC TOTAL: \n"my_rank);
  for (
ii 0ii rows_per_proc[my_rank]; ii++) {
    for (
jj 0jj NNjj++) {
      
printf("%lf "my_CC[ii][jj]);
    }
    
printf("\n");
  }
  
/* Finally, the MASTER process gathers the partial results into a global
     matrix named CC. Again, WE WILL WAIT FOR THE ALLOCATION OF THIS GLOBAL
     ARRAY BECAUSE THERE IS NO POINT ON HAVING THE ROOT PROCESS DOING IT, IF IN
     THE SPECIFIC POINT OF THE EXECUTION OF THE CODE WHERE THE BROADCAST OCCURS,
     AN ERROR COULD LEAD TO A PREMATURE INTERRUPTION OF THE CODE. Clearly, every
     other allocation so far considered has to be done in their respective point
     since they are needed in the computation of the required product. */
     
  /* Make sure everyone has finished before calling the gathering: */
  
MPI_Barrier(MPI_COMM_WORLD);
  
  if (
my_rank == MASTER) {
  
    
/* Memory allocation (and its verification) for CC: */
    
CC = (double **) malloc(NN*sizeof(double));
    if (
CC != NULL) {
      for (
ii 0ii NNii++) {
        
CC[ii] = (double *) malloc(NN*sizeof(double));
      }
    } else {
      if (
my_rank == MASTER) {
        
printf("ERROR 08: Matrix CC could not be allocated!\n");
        
printf("Exiting...\n");
      }
      return 
EXIT_FAILURE;
    }
    
    for (
ii 0ii NNii++)
      for (
jj 0jj NNjj++)
        
CC[ii][jj] = -1.0;
        
    
printf("\n\nMASTER: CC BEFORE GATHER: \n");
    for (
ii 0ii NNii++) {
      for (
jj 0jj NNjj++) {
        
printf("%lf "CC[ii][jj]);
      }
      
printf("\n");
    }
    
printf("\n");
  }
  
  
MPI_Gathermy_CC,
              
rows_per_proc[my_rank]*NN,
              
MPI_DOUBLE,
              
CC,
              
rows_per_proc[my_rank]*NN,
              
MPI_DOUBLE,
              
MASTER,
              
MPI_COMM_WORLD);
              
  if (
my_rank == MASTER) {
    
printf("MASTER: CC: \n");
    for (
ii 0ii NNii++) {
      for (
jj 0jj NNjj++) {
        
printf("%lf "CC[ii][jj]);
      }
      
printf("\n");
    }
    
printf("\n\n");
  }
  
/* Finalize time measure: */
  
if (my_rank == MASTER)
    
end_time MPI_Wtime();
  
/* Finalize MPI AND finalize execution: */
  
MPI_Finalize();  
  return 
EXIT_SUCCESS;
}

/* This routine computes the dot product betwen two vectors passed as arrays of
   double precision floating point values xx and yy. The dimension N is
   assumed to be the same for both of them. Mathematically, this operation
   returns an scalar value here represented as an double floating point
   variable. */

double DotProduct (double *xxdouble *yyint NN) {

  
unsigned int ii;
  
double aux;
  
  
aux 0.0;
  for (
ii 0ii NNii++ ) {
    
aux aux xx[ii]*yy[ii];
  }
  return 
aux;

Is giving my a seg-fault right at the ending point where I do the final gather from the MASTER processor.

Any help will be welcome!
 
Old 04-13-2011, 03:37 PM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by ejspeiro View Post
I just can't find... why is my code seg-faulting?
Is giving my a seg-fault right at the ending point where I do the final gather from the MASTER processor.
Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.

First, if you comment out the
Code:
  MPI_Gather( my_CC,
              rows_per_proc[my_rank]*NN,
              MPI_DOUBLE,
              CC,
              rows_per_proc[my_rank]*NN,
              MPI_DOUBLE,
              MASTER,
              MPI_COMM_WORLD);
the code won't crash; this confirms your diagnosis on where your program actually crashes.

So, let's check the parameters one by one. See man MPI_Gather for the details.
  • my_CC
    This is the send buffer.
  • rows_per_proc[my_rank]*NN
    This is the number of elements in the send buffer. Umm.. Are you sure about this value?
  • MPI_DOUBLE
    This is the data type of elements in the send buffer.
  • CC
    This is the receive buffer. It's only used in the receiving process, but you really should set it to NULL on others.
    If you use warnings when compiling, e.g mpicc -Wall, you'll see the note about CC.
  • rows_per_proc[my_rank]*NN
    This is the number of elements per process in the receive buffer. Umm.. Are you sure about this value?
  • MPI_DOUBLE
    This is the data type of elements in the receive buffer. (MPI can do some data type conversions on the fly.)
  • MASTER
    Receiving process. You've defined this as zero, which means the root process. Quite okay.
  • MPI_COMM_WORLD
    Gather is done from the entire group; okay.

Perhaps you should reconsider your MPI_Gather call. Perhaps you should take a look at the MPI_Gather example first? Pay attention to why the example uses MPI_Gatherv instead. Also, remember that the receive buffer will contain the data from each of the pp processes, including the receive process itself.

Although MPI_Gatherv() is pretty much perfect for this case, you can try to get your program working by replacing it with MPI_Send()s on all but the master process, and corresponding MPI_Recv()s on the master process. If you do that, it would probably help you understand which kind of MPI_Gather()/MPI_Gatherv() call you need to do it all at once.

Hope this helps.
 
Old 04-13-2011, 04:16 PM   #3
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by Nominal Animal View Post
Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.
I don't know MPI myself, but I expected to be able to understand an example this easy from the documentation and help the OP. The documentation is less helpful than I would have expected. There are plenty of details documented, but basic concepts are largely missing or buried.

So I think you could have explained a few of the basic concepts the OP seems to have misunderstood or pointed to some of the key spots in the doc that a beginner might easily miss. That still isn't doing the exercise.

Quote:
So, let's check the parameters one by one. See man MPI_Gather for the details.
Those parameter descriptions are a bit short of understandable, especially recvcount. The OP seems to have misunderstood recvbuf as well.

Nearby that page there is a section much more helpful within
http://www.mpi-forum.org/docs/mpi-11...69.html#Node69

Notice where it says
Code:
The outcome is as if each of the n processes in
the group (including the root process) had executed a call to
MPI.Send(sendbuf,sendcount,sendtype,root,...)
and the root had executed n calls to
MPI.Recv(recvbuf+i*recvcount*extent(recvtype),recvcount,recvtype,i,...)
That gives you a direct understanding of how recvcount and recvbuf are used by MPI_Gather.

Eduardo, is the receive buffer you allocated compatible with MPI_Gather using it as the text I just quoted describes?

Quote:
Although MPI_Gatherv() is pretty much perfect for this case, you can try to get your program working by replacing it with MPI_Send()s on all but the master process, and corresponding MPI_Recv()s on the master process.
That's a very good suggestion.

Quote:
If you do that, it would probably help you understand which kind of MPI_Gather()/MPI_Gatherv() call you need to do it all at once.
Only after you find the documentation that tells you what MPI_Gather()/MPI_Gatherv() really do, rather than the documentation that ambiguously documents their parameters.

Also, all the places the original code says
(double **) malloc( something *sizeof(double))
are incorrect. That should be
(double **) malloc( something *sizeof(double*))
That error just wastes memory. It won't cause any crash nor wrong answer. But it is still an error.

BTW, I haven't looked through the details of your actual matrix product logic, nor do I know best technique for distributing a matrix product. But even trivial thought regarding data flow should tell you it is lame to parallelize the problem by dividing into groups of rows. You should parallelize by dividing the solution matrix into nearly square rectangular chunks.

Last edited by johnsfine; 04-13-2011 at 04:40 PM.
 
Old 04-13-2011, 08:21 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by johnsfine View Post
So I think you could have explained a few of the basic concepts the OP seems to have misunderstood or pointed to some of the key spots in the doc that a beginner might easily miss. That still isn't doing the exercise.
You're right. The OpenMPI manual pages have much better descriptions and explanations of each function, and has examples for most of the functions. There is a very good MPI tutorial at Lawrence Livermoore.

Quote:
Originally Posted by johnsfine View Post
nor do I know best technique for distributing a matrix product. But even trivial thought regarding data flow should tell you it is lame to parallelize the problem by dividing into groups of rows. You should parallelize by dividing the solution matrix into nearly square rectangular chunks.
I agree. You can then distribute either rectangular parts of the left and right matrices, or full rows of the left matrix and full columns of the right matrix. It'd be quite interesting to compare the total costs of each method..

For very small matrices, say < 20x20, I'd just broadcast the entire left and right side matrices to each process, have them do some of the computational work, into a result matrix which is initialized to zeros. Then you can reduce them using sum operation, to get the actual result matrix -- and it does not matter which process has done which part of the work, as long as all of the computations have been done.

It is always a good idea to think about the operation at hand:
result[row][column] = sum over i: left[row][i] * right[i][column]
Note that transposing the right matrix makes the computations a lot more efficient, since the data is then consecutive in memory.
 
1 members found this post helpful.
Old 04-14-2011, 07:40 AM   #5
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by Nominal Animal View Post
I agree. You can then distribute either rectangular parts of the left and right matrices, or full rows of the left matrix and full columns of the right matrix. It'd be quite interesting to compare the total costs of each method.
I was having trouble visualizing an underlying communication system for MPI efficient enough that there could be any savings in elapsed time vs. having one process do the whole job (matrix multiply). The ratio of work to communication goes up some when the matrix is larger. But still not enough to make sense.

I was thinking of a rectangular division of processes into intersecting teams (of course I don't know how hard teams are to set up in MPI). Given the total number of processes, computer the integer square root, then the integer quotient for the two team dimensions: Example, 23 processes. square root of 23 is 4. 23/4 is 5. So use a 4x5 set of intersecting teams and waste the extra 3 processes.

Each process would be a member of one of four teams of 5 and one of five teams of 4. Each team of 5 would multiply all of the A rows by one quarter of the B columns. Each team of 4 would multiply one fifth of the A rows by all the B columns. So each process acting simultaneously in two teams would multiply one fifth of the A rows times one quarter of the B columns.

Based on the OPs design, each process would first invent data for its twentieth of A rows and its twentieth of B columns.

Then the 5 row teams in parallel would each go through 4 team members in sequence to exchange/keep row data, so each member of a row team then had the rows of the entire team.

Then the 4 column teams in parallel would each go through 5 team members in sequence to exchange column data, and as each chunk of column data was exchanged, each process could multiply it by all the rows of its row team and store those results and discard that column chunk before going on to the next.

Then all the results could be handed to the master, which would get them in a strange sequence (by rectangular chunk, rather than by row). If it just needs to display the results it just needs an accessor function that takes row and column number and grabs the right element from the chunks oriented result. If it needs significant further work, it would need to reorder the result. But that is likely less expensive than using MPI comm enough more to deliver the result in final sequence.

If you add up all the data flow (especially if comm by teams in parallel really is in parallel) you should see enough less total work to more than compensate for idling a few processes.

If the matrix dimension isn't divided evenly by the (adjusted) number of processes, that leads to two level special case setup of dividing the matrix almost evenly by number of teams and dividing the result almost evenly by members per team. That would be twice as tricky as the adjustments the OP got only partially correct in his division by rows in the first post.

Last edited by johnsfine; 04-14-2011 at 11:41 AM.
 
Old 04-14-2011, 11:26 AM   #6
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Original Poster
Rep: Reputation: 26
Nominal Animal:

Quote:
Since this is an exercise, I don't want to give you an answer; I want to help you find it yourself. I hope you understand and appreciate that.
Yes, off course I appreciate that

Your hint wether if I should use MPI_Gather or not, was really close. I already solved the entire problem and figured out A LOT of my original mistakes. Most of them relied in two great issues:

- Definition of the data structures used for collective communiction calls.
- Memory allocation issues.

Thank you very much for your first post

I am still reading / understanding the other post so I can give a cool answer to all of them
 
Old 04-14-2011, 01:20 PM   #7
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Original Poster
Rep: Reputation: 26
Question johnsfine:

johnsfine:

Quote:
The documentation is less helpful than I would have expected. There are plenty of details documented, but basic concepts are largely missing or buried.
I agree u.u that is why I have a book for this and I also used the tutorial from LLNL that Nominal Animal recomended on his second post (awesome recommendation Nominal Animal, thank you for that).

Quick question to johnsfine:

Quote:
(...) and help the OP.
What is the OP?
 
Old 04-14-2011, 01:58 PM   #8
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by ejspeiro View Post
What is the OP?
Original poster. The person who wrote the first post of the current thread (you).

When multiple people answering the same thread start discussing subtopics among themselves, a short hand way to refer back to whoever started the topic is often needed.
 
1 members found this post helpful.
Old 04-14-2011, 04:45 PM   #9
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by johnsfine View Post
I was having trouble visualizing an underlying communication system for MPI efficient enough that there could be any savings in elapsed time vs. having one process do the whole job (matrix multiply). The ratio of work to communication goes up some when the matrix is larger. But still not enough to make sense.
Matrix multiplication is just an example operation here, often used in distributed computing courses because it is easy to understand, but also illustrates well the difficulties in efficient data partitioning (for distributed computing).

There is also the fact that you should always design MPI programs to do computations while communications are in progress, preparing for the next operation (or even better, the operation after that one). That way the cost of the communication is neglible. I have designed certain spatial partitioning schemes to allow this for classical molecular dynamics simulations, including Monte Carlo MD simulations, but there has been very little interest in it. Usually people just throw more hardware -- low-latency infiniband, more computing nodes -- at it rather than fix the logic in their programs. I do understand their concern, however: it does increase complexity, and a lot of programmers (especially physicists) are unfamiliar with concepts like asynchronous I/O. (For spatially distributing Monte Carlo MD simulations I've been thus far unable to retain perfect detail balance, although the error is extremely small. The distribution mechanism does for example totally eliminate all spatial boundary errors.)

Quote:
Originally Posted by johnsfine View Post
I was thinking of a rectangular division of processes into intersecting teams (of course I don't know how hard teams are to set up in MPI).
Extremely easy. You simply agree (between all participating processes in a team) to create a communicator (corresponds to a group of communicating processes) using the MPI_Comm_ functions, and use the team communicators instead of MPI_COMM_WORLD.
There are also MPI_Cart_ functions to create and manage Cartesian communications topologies, so creating the communicators according to your example would be very easy.

Quote:
Originally Posted by johnsfine View Post
So use a 4x5 set of intersecting teams and waste the extra 3 processes.
That is one workable strategy, but note that the wall clock duration of the operation is measured by the slowest team (or member).

It is usually more efficient to assign the work in a smorgasbord fashion: divide the work into much more chunks than there are workers, and have each worker grab the next available chunk, and work on it. The downside is of course the loss of data locality, at least in theory.

In practice, if the master process knows (or the workers tell the master) which sub-matrices the worker asking for more work already has locally, it can use a simple greedy algorithm to minimize the number of sub-matrices transmitted. This is much more robust, since if a worker is delayed for any reason, the system will automatically accommodate that.

Funnily enough, the code is not at all complex when using this kind of smorgasbord or worker pool mechanism (even tends to be quite compact!), but the logic is often hard to grasp for someone unfamiliar with the idea.

Quote:
Originally Posted by johnsfine View Post
Then all the results could be handed to the master, which would get them in a strange sequence
That's actually quite a common thing, and MPI has facilities to handle that gracefully.

For example, you can tag each data chunk with an arbitrary integer. In the receiving end, you can set up asynchronous receives for a specific tag (from a specific process or any process), and then just check or wait until they've all completed. You can define a sparse sub-matrix data type, and use that type to receive the data directly into the correct locations of the result matrix. (If the matrix is a continuous array, you only need one type per each different sub-matrix size. If all the sub-matrices are of the same size, one custom type is sufficient.)
 
Old 04-18-2011, 09:41 PM   #10
ejspeiro
Member
 
Registered: Feb 2011
Distribution: Ubuntu 14.04 LTS (Trusty Tahr)
Posts: 203

Original Poster
Rep: Reputation: 26
Thumbs up johnsfine and Nominal Animal:

johnsfine:

Now that I have finished the exercise entirely, I realized were my errors were! For example, that error in the dynamic memory allocation was enoying and I actually fixed it, so thank you.

I read your proposition to a parallel aproach for the domain decomposition of this problem in post number 5 and yes, it is awesome! The reason I did it the way I did is because the instructor wanted us to do it, specifically like that, so I decided to go with the flow

Nominal Animal:

Quote:
Note that transposing the right matrix makes the computations a lot more efficient, since the data is then consecutive in memory.
Yes! I actually implemented to store the transposed of the matrix B and wrote a serial version of the code, for both of them, transposed and not transposed... the execution times decreases dramatically for matrix sizes up to 4000 x 4000.

I would like to thank both of you guys since your suggestions really taugh me a lot!

For me, this assignment was really illustrative since I also learned how to use gdb for debugging in parallel environments, which results very useful when the problems get really complicated.

Regards!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
is there a matrix screensaver, very exactly like in the Matrix movie? frenchn00b Linux - Desktop 2 08-20-2009 10:00 AM
Matrix Multiplication on 6800 Instuction set shadow85 Programming 1 09-10-2008 09:50 PM
awk convert column matrix to square matrix? johnpaulodonnell Programming 4 04-30-2008 01:45 PM
an existing efficient matrix multiplication algorithm? George2 Programming 2 10-16-2006 12:54 AM
Parallel matrix multiplication abdobl Programming 3 09-22-2004 06:11 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:55 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration