LinuxQuestions.org - matrix multipication using pthread

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - matrix multipication using pthread (https://www.linuxquestions.org/questions/programming-9/matrix-multipication-using-pthread-829025/)

matrix multipication using pthread

Dear All,

I am trying to do a matrix multiplication using pthread.I have attached the code i have written for it

#include<stdio.h>
#include<pthread.h>
#include<stdlib.h>

void *cal(void *);
pthread_mutex_t mutex;
int a[2][2],b[2][2],c[2][2];

int gol_row;
int gol_col;

main()
{
int i,j;
pthread_t p1[2][2];
//pthread_mutex_init(&mutex,NULL);
fill_mat();
for(i=0;i<2;i++)
for(j=0;j<2;j++)
{
//pthread_mutex_lock(&mutex);
gol_row=i;
gol_col=j;
pthread_create(&p1[i][j],NULL,cal,NULL);
//pthread_mutex_unlock(&mutex);
}
for(i=0;i<2;i++)
for(j=0;j<2;j++)
{
pthread_join(p1[i][j],NULL);
}
display();
}

void *cal(void *t)
{
int row,col,tot,t1,t2,i,j;
printf("i am in cal and my id is %d\n",pthread_self());
//pthread_mutex_lock(&mutex);
printf("gol_row =%d,gol_col=%d\n",gol_row,gol_col);
for(j=0;j<2;j++)
{
t1=+a[gol_row][j]*b[j][gol_row];
}
c[gol_row][gol_col]=t1;
//pthread_mutex_unlock(&mutex);
return (NULL);
}

fill_mat()
{
int i,j,temp;
for(i=0;i<2;i++)
for(j=0;j<2;j++)
{
printf("enter the element for a[%d][%d] ",i,j);
scanf("%d",&temp);
a[i][j]=temp;
}
for(i=0;i<2;i++)
for(j=0;j<2;j++)
{
printf("enter the element for b[%d][%d] ",i,j);
scanf("%d",&temp);
b[i][j]=temp;
}
}

display()
{
int i,j;
printf("A mat is\n");
for(i=0;i<2;i++)
{
for(j=0;j<2;j++)
{
printf("%d\t",a[i][j]);
}
printf("\n");
}
printf("B mat is\n");
for(i=0;i<2;i++)
{
for(j=0;j<2;j++)
{
printf("%d\t",b[i][j]);
}
printf("\n");
}
printf("C mat is\n");
for(i=0;i<2;i++)
{
for(j=0;j<2;j++)
{
printf("%d\t",c[i][j]);
}
printf("\n");
}
}

And here is the output i am getting.

[root@localhost h]# ./mat
enter the element for a[0][0] 1
enter the element for a[0][1] 2
enter the element for a[1][0] 3
enter the element for a[1][1] 4
enter the element for b[0][0] 1
enter the element for b[0][1] 2
enter the element for b[1][0] 3
enter the element for b[1][1] 4
i am in cal and my id is -1208259696
gol_row =1,gol_col=1
i am in cal and my id is -1218749552
gol_row =1,gol_col=1
i am in cal and my id is -1229239408
gol_row =1,gol_col=1
i am in cal and my id is -1239729264
gol_row =1,gol_col=1
A mat is
1 2
3 4
B mat is
1 2
3 4
C mat is
0 0
0 16

I am new to thread programming.Please can anyone help me to debug the program or could guide me where i am going wrong

With regards,
Keerti

Hi -

You can't do this:

Code:

// Globals

int gol_row;

int gol_col;

...



main()

{

  int i,j;

  pthread_t p1[2][2];

  for(i=0;i<2;i++)

    for(j=0;j<2;j++)

    {

      gol_row=i;

      gol_col=j;

      pthread_create(&p1[i][j],NULL,cal,NULL);

      ...



void *cal(void *t)

{

  int row,col,tot,t1,t2,i,j;

  ...

  printf("gol_row =%d,gol_col=%d\n",gol_row,gol_col);

  for(j=0;j<2;j++)

  {

    ...

The reason is that you have no control over what globals "gol_row" and "gol_col" happen to be at the moment your thread is actually started.

SUGGESTION:
Declare a struct, and pass that struct into the thread.

EXAMPLE:

Code:

// Globals

struct thread_rec {

  int gol_row;

  int gol_col;

};

...



main()

{

  int i,j;

  struct thread_rec tr;

  pthread_t p1[2][2];

  for(i=0;i<2;i++)

    for(j=0;j<2;j++)

    {

      tr.gol_row=i;

      tr.gol_col=j;

      pthread_create(&p1[i][j],NULL,cal,&tr);

      ...



void *cal(void *t)

{

  int row,col,tot,t1,t2,i,j;

  struct thread_rec *tr = (struct thread_rec *)t;

  ...

  printf("gol_row =%d,gol_col=%d\n",tr->gol_row,tr->gol_col);

  for(j=0;j<2;j++)

  {

    ...

'Hope that helps

Moved: This thread is more suitable in <PROGRAMMING> and has been moved accordingly to help your thread/question get the exposure it deserves.

If you need gol_row and gol_col to be system globals, then you need to enforce locking and use signaling to enforce thread synchronization in order to get the correct values out of them.

I infer from your code that you expect your thread to run immediately after you start it, before your main code continues. As you can see, this is not happening. I also see that you have provided and commented out some mutex statements - presumably because they were not working the way you intended.

You need to set up a mutex condition variable as a global:

Code:

pthread_cond_t thread_condition_ready = PTHREAD_COND_INITIALIZER;

Then, since you already have a mutex, I'll just use that. At an appropriate place in your main thread (after starting the new thread) you have to wait on the mutex:

Code:

pthread_cond_wait(&thread_condition_ready,&mutex);

This locks the main thread until it is unlocked.
In your child thread, at the end after processing what needs to be processed, you want to release the main thread so it can run:

Code:

pthread_mutex_lock(&mutex);

pthread_cond_signal(&thread_condition_ready);

pthread_mutex_unlock(&mutex);

Now, your events will occur in the order you want. Of course, this renders your multi-threading trivial and moot. But then, clearly your example is not optimum for multi-threading. Presumably you have something more significant in mind, once you get the threading figured out.

Edit:

Oh...also...

Print your process ID as %x rather than as %d.

suggestion about matrix product

jkeertir,

You are not indexing through the rows and columns of your product matrix. You simply initialized those to 1 and 1 and left them there.

You may wish to consult reference material for linear algebra, vectors, and matrices. The product of two matrices, if the two matrices are compatible for multiplication, is not done by simply multiplying the two elements in corresponding positions and assigning the product to the corresponding element in the product matrix. Two matrices are compatible for multiplication when the number of rows in the second matrix equals the number of columns in the first matrix. If this condition is satisfied, then the product matrix will be a matrix with number of rows equal to the number of rows in the first matrix and number of columns equal to the number of columns in the second matrix. The element of the product matrix M_Product at some row Product_Row and some column Product_Column is the summation of all the products of the corresponding elements of the row of the first matrix and the column of the second matrix.
Matrix multiplication is not commutative except by coincidence. That is, A X B is not necessarily equal to B X A.

In your example, with

Code:

matrix A = B = 

1 2

3 4

the product

Code:

unless I made an error in my evaluation arithmetic.

Code:

// for every row of product

for( Product_Row = 0, Product_Row < nRows_A, Product_Row++ )

{ 

    // for every column in the row of the product

    for( Product_Column = 0, 

        Product_Column < nColumnsB, 

        Product_Column++ )

    {

        // initialize product element to 0

        M_Product[Product_Row][Product_Column] = 0;

        // for every element of row,column vectors

        for( nElement = 0; nElement < nColumns_A, nElement++ )

            // add product of element of first matrix row  

            //    times second matrix column

            M_Product[Product_Row][Product_Column] += 

                    MatrixA[Product_Row][nElement] 

                *  Matrix_B[nElement][nProduct_Column];

    }    // for every column in the row of the product

}  // for every row of product

If you are writing a massively multi-threaded program for matrix multiplication in order to take advantage of massively parallel processor architecture, the obvious point to implement multi-threading would be the code that implements the inner multiply-accumulate code that calculates each element in the product.

There isn't that much of good "MIMD"* concurrent multiprocessing hardware to find in the stores. An X86 architecture is not optimal for massively concurrent multiprocessing, no matter how many cores one has to play with.
To make a multithreaded application really run fast and utilize the hardware efficiently, one has to do careful testing/debugging and analysis of the code produced.

*(MIMD - Multiple instruction flow, multiple data flow architecture).

I have tinkered some with Transputer arrays - They really like massively concurrent multiprocessing. Heck, they where built for it. But still, in 99% of the applications it is best practice to divide the processing into "chunks" big enough for a core to bite on for a while. Think of it like this - If each element of processing is smaller than the available rescheduling interval in the system, then it will spend most of its time and resources in rescheduling the processes, not doing any actual work.

Try to make a few threads that at least takes some intervals of rescheduling to complete each subtask. Descheduling more than some 1000 times per second on an X86 architecture is wasteful.

Start one work thread per core available. Or possibly two or three threads per core, depending on if the threads will be waiting for filling of its I/O buffers or waiting for memory to be pages in every once some often. This minimize the overhead work in scheduling, and leaves more cpu time to the actual processing.

I stumbled into such a problem yesterday, in an application handling large amounts of random access disk data. I suddenly got a throughput of about 15 megabytes per second. Not what i had expected. Found out that it was because every time a thread allocated a transfer buffer to store results it was forced to deschedule. Fixed the problem by using overallocation of memory, and let the paging mechanism fill in the pages best effort instead of asking for a buffer all the time. Now the data throughput in the application is around 2 gigabytes per second, which is quite satisfying.

So, in multiprocessing, the old saying still applies: "Keep it simple, stupid".
(meaning no offence)

I concur with the KISS recommendation.

The calculations of the elements of the product matrix would actually work on a SIMD (Single Instruction Multiple Data) multiprocessor machine. All the operations in any particular matrix product are the vector dot product operation on different pairs of vectors all of the same length.

If there is any way to make use of the special multimedia extension instruction set, that might accelerate the operation.

GPUs actually execute such operations, at least multiplying 4-vectors by 4X4 matrices very rapidly and likely 4X4 matrices by other 4X4 matrices to combine minor frame transformation matrices into major transform matrices for 3D rendering.

I have never actually implemented any such parallelized operations. I can imagine that either one would use specialized parallelization libraries, a compiler which performs parallelization optimizations, or perhaps some specialized language with facilities to implement parallelized MAC operations.

That is just my two cents' worth.