[SOLVED] omp[ fails after dist upgrade

piobair · 02-02-2024, 02:00 PM

Linux geeves 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
package: libomp-dev 1:14.0-55.7-deb12u1

Code:

// oomptest.c
// compile as gcc oomptest.c -o oomp -lm -fopenmp

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int omp_get_max_threads(void);
int omp_get_thread_num(void);

void subroutine(float *subject, float *object, int index){
  object[index] -= subject[0];
}
	

int main(int argc, char* argv[]){
	int i, id;
	float v0[2] = {13579, 24680}, v1[5]={1, 3, 5, 7, 9};

	#pragma omp parallel num_threads(5) private (id)
	{
		id = omp_get_thread_num(); printf("line 22 id=%d\n", id);
		subroutine(&v0[1], &v1[id], id);
	}
	for(i=0; i<5; i++)printf("v1[%d] = %f\n", i, v1[i]);
	
}

output:

Quote:

line 22 id=0
line 22 id=3
line 22 id=1
line 22 id=2
line 22 id=4
v1[0] = -24679.000000
v1[1] = 3.000000
v1[2] = -24675.000000
v1[3] = 7.000000
v1[4] = -24671.000000

expected output:

Quote:

line 22 id=0
line 22 id=3
line 22 id=1
line 22 id=2
line 22 id=4
v1[0] = -24679.000000
v1[1] = -24677.000000
v1[2] = -24675.000000
v1[3] = -24673.000000
v1[4] = -24671.000000

Commentary:
The original program performed a gaussian elimination on a very large, very sparse linear algebra matrix. The matrix consists of two flat files: one file to the left of the equation, and one to the right of the
equation. Each element in those files is a structure with six degrees of freedom. Thus, what is actually being passed to the omp pragma are the leftmost structure in each equation, and also the rightmost structure in each equation.

As compiled, the program runs without fault under valgrind, will run successfully if "number of threads" is reduced to one, but fails for more than one thread.

EdGr · 02-03-2024, 12:33 PM

The test program fails on my machine with gcc 13.2.0. The behavior is not deterministic. Some runs end in segfaults. There is some race problem.

I use pthreads and can't help much with OpenMP.
Ed

ntubski · 02-03-2024, 09:34 PM

Code:

void subroutine(float *subject, float *object, int index){
  object[index] -= subject[0];
}

subroutine(&v0[1], &v1[id], id);

This should be

Code:

subroutine(&v0[1], &v1[0], id);
// or just
subroutine(&v0[1], v1, id);

Alternatively,

Code:

void subroutine(float *subject, float *object, int index){
  (void) index; // index not used
  object[0] -= subject[0];
}

subroutine(&v0[1], &v1[id], id);

Otherwise you are accessing v1[2*id] rather than the v1[id] you intended. (With one thread it's fine, since 2*0 == 0.)

piobair · 02-08-2024, 02:09 PM

So your solution to a buggy omp library is to avoid omp entirely?
I submit that a better solution is to change the number of threads passed to the omp pragma from 5 to one. That will work since the omp library apparently does work for single threads. For my application, that means solving the problem with one thread chunks instead multiple thread chunks. When the library eventually does get fixed, I can simply change the '1' back to "nthreads".

Quote:

Originally Posted by ntubski

Code:

void subroutine(float *subject, float *object, int index){
  object[index] -= subject[0];
}

subroutine(&v0[1], &v1[id], id);

This should be

Code:

subroutine(&v0[1], &v1[0], id);
// or just
subroutine(&v0[1], v1, id);

Alternatively,

Code:

void subroutine(float *subject, float *object, int index){
  (void) index; // index not used
  object[0] -= subject[0];
}

subroutine(&v0[1], &v1[id], id);

Otherwise you are accessing v1[2*id] rather than the v1[id] you intended. (With one thread it's fine, since 2*0 == 0.)

ntubski · 02-08-2024, 06:01 PM

Quote:

Originally Posted by piobair

So your solution to a buggy omp library is to avoid omp entirely?

No, that is not my solution. Sorry for being unclear, I included just the parts of your program that were relevant, to keep things short. That is, I left out the omp parts because you don't need to change them.

You have a bug in your program, the omp library is fine. It just happens that the bug is hidden when you are only using one thread. It is not even a threading bug, it's just about the arithmetic you are doing on the thread ID number.

My solution is to put 0 instead of id or index in one of the highlighted places (choose one of the places, not both).

piobair · 02-13-2024, 03:31 PM

That would work if v0 and v1 were not both pointers to a linked list (the left side of linear algebra).
I tried to keep the program snippet simple. In practice, I am also similarly passing two pointers to a different linked list (the right side of the equation).

Quote:

Originally Posted by ntubski

No, that is not my solution. Sorry for being unclear, I included just the parts of your program that were relevant, to keep things short. That is, I left out the omp parts because you don't need to change them.

You have a bug in your program, the omp library is fine. It just happens that the bug is hidden when you are only using one thread. It is not even a threading bug, it's just about the arithmetic you are doing on the thread ID number.

My solution is to put 0 instead of id or index in one of the highlighted places (choose one of the places, not both).

NevemTeve · 02-13-2024, 03:47 PM

You have a bug in your minimal sample program, so it is plausible that you have bugs in the actual application as well.

piobair · 02-13-2024, 07:10 PM

Please explain to me why I have a bug in my sample program.
My sample program works correctly for id = 1, 3, 5. For id == 1, 3, 5 it reads from both the v0 array and the v1 array correctly.
As I said in my initial posting, this code has worked correctly for the past ten years, having been recompiled at least once for each distribution upgrade.
The only thing that has changed is the version. In that program, I am passing the addresses of two structures in one linked list of structures, and the addresses of two structures in a parallel linked list of structures. My (poor old) CPU will handle up to 8 simultaneous threads.

Just to see what would happen, I changed line 23 to read:

Code:

subroutine(&v0[id%2], &v1[id], id);

(switching back and forth between v0[0] and v0[1])
I only got one correct answer.

Quote:

Originally Posted by NevemTeve

You have a bug in your minimal sample program, so it is plausible that you have bugs in the actual application as well.

EdGr · 02-13-2024, 09:19 PM

The v1 array is being indexed twice - once in the caller and again in the callee (as object). You need to index it once.

The buggy code works only for id==0.

Does the real program have the same bug?
Ed

ntubski · 02-13-2024, 10:19 PM

Quote:

Originally Posted by piobair

I tried to keep the program snippet simple. In practice, I am also similarly passing two pointers to a different linked list (the right side of the equation).

Then you have made it too simple. Please try again using linked lists.

Quote:

Originally Posted by piobair

Please explain to me why I have a bug in my sample program.

Okay, here is your program without omp but still with the same bug:

Code:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

void subroutine(float *subject, float *object, int index){
  object[index] -= subject[0];
}

int main(){
  /* purposely making v1 extra long, so that the program has deterministic behaviour */
  float v0[2] = {13579, 24680}, v1[10]={1, 3, 5, 7, 9};

  for (int id=0; id < 5; id++) { //#pragma omp parallel num_threads(5) private (id)
    printf("line xx id=%d\n", id);
    subroutine(&v0[1], &v1[id], id);
    for(int i=0; i<10; i++)printf("v1[%d] = %f\n", i, v1[i]);
  }
  return 0;
}

The output

Code:

line xx id=0
v1[0] = -24679.000000
v1[1] = 3.000000
v1[2] = 5.000000
v1[3] = 7.000000
v1[4] = 9.000000
v1[5] = 0.000000
v1[6] = 0.000000
v1[7] = 0.000000
v1[8] = 0.000000
v1[9] = 0.000000
line xx id=1
v1[0] = -24679.000000
v1[1] = 3.000000
v1[2] = -24675.000000
v1[3] = 7.000000
v1[4] = 9.000000
v1[5] = 0.000000
v1[6] = 0.000000
v1[7] = 0.000000
v1[8] = 0.000000
v1[9] = 0.000000
line xx id=2
v1[0] = -24679.000000
v1[1] = 3.000000
v1[2] = -24675.000000
v1[3] = 7.000000
v1[4] = -24671.000000
v1[5] = 0.000000
v1[6] = 0.000000
v1[7] = 0.000000
v1[8] = 0.000000
v1[9] = 0.000000
line xx id=3
v1[0] = -24679.000000
v1[1] = 3.000000
v1[2] = -24675.000000
v1[3] = 7.000000
v1[4] = -24671.000000
v1[5] = 0.000000
v1[6] = -24680.000000
v1[7] = 0.000000
v1[8] = 0.000000
v1[9] = 0.000000
line xx id=4
v1[0] = -24679.000000
v1[1] = 3.000000
v1[2] = -24675.000000
v1[3] = 7.000000
v1[4] = -24671.000000
v1[5] = 0.000000
v1[6] = -24680.000000
v1[7] = 0.000000
v1[8] = -24680.000000
v1[9] = 0.000000

I hope that is clear enough.

Quote:

My sample program works correctly for id = 1, 3, 5. For id == 1, 3, 5 it reads from both the v0 array and the v1 array correctly.

Your threads are numbered 0 to 4. You don't have any thread with id = 5. And you should be able to infer from the above that only id = 0 is working correctly. The only part I'm not sure about is why Valgrind didn't catch your writes past the end of the v1 array.

Quote:

As I said in my initial posting, this code has worked correctly for the past ten years, having been recompiled at least once for each distribution upgrade.

Perhaps you are thinking of some other post; you didn't actually say this in your initial posting.

wainamoinen · 02-14-2024, 09:24 AM

Just a piece of advice. I don't know if your original code is similar to the one you show here.

In the code you show, all threads are modifying v1 at the same time, this may logically be Ok, but in terms of hardware if several cores act in the same memory block you may have a "false sharing" performance issue. This is because each core has a copy of the block in a cache line, and if one modifies its copy the other cores go stall until the caches are synchronized.

How bad this is depends on the number of cache levels and the number of processors/cores you are using.

Better explained:
https://en.wikipedia.org/wiki/False_sharing
https://parallelcomputing2017.wordpr...false-sharing/
https://cpp-today.blogspot.com/2008/...its-again.html

piobair · 02-14-2024, 02:29 PM

[QUOTE=ntubski;6483437]Then you have made it too simple. Please try again using linked lists.

Code:

// parallel_test.c
// compile as gcc parallel_test.c -o parallel_test -lm -fopenmp

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int omp_get_max_threads(void);
int omp_get_thread_num(void);

typedef struct snode snode;
struct snode {int id, vid, mark; float value; struct snode *next;} newnode;

int gaussian(struct snode *line1, struct snode *head){printf("line 14\n");
	float v; 
	struct snode *inode, *jnode, *knode;
	
	printf("line 18\n");
	inode = line1; printf("line 19 line1 = %d\n", line1->vid);
	jnode = head; printf("line 20 line[id] = %d; %f\n", head->vid, head->value); printf("line 20a value = %f\n", head->value);
	v = line1->value / head->value;
	for(;;){
		jnode->value = jnode->value *v -inode->value;
		if(!(inode->mark & 4)) if((inode->next)->vid > (jnode->next)->vid){ // add additional node (jnode->next)->vid = (inode->next)->vnode
			knode = malloc( sizeof( newnode));
			knode->next = jnode->next;
			jnode->next = knode;
			knode->id = (inode->next)->id;
			knode->vid = (inode->next)->vid;
			knode->mark = 0;;
			knode->value = 0;
		}
		while((jnode->next)->vid < (inode->next)->vid) jnode->next = (jnode->next)->next; // skip over unaligned nodes
		if(inode->mark & 4) return(0);
		inode = inode->next;
		jnode = jnode->next;
	}
}
	

int main(int argc, char* argv[]){
	int i, id, nt;
	float v;
	struct snode newnode, *node0, *inode, *jnode, *knode, *nnode, *endnode, **beginning_of_line, *end_of_line, *end_of_line1;
	int available, nthreads;
	
	available = omp_get_max_threads(); // determine number of threads available
	beginning_of_line = malloc(sizeof(snode *) *available);
	
	// The following is a sparce matrix in flat form.
	// For simplicity in this program, "value" has only one degree of freedom.
	// In a real-world problem, "value" would have six degrees of freedom
	// Each line in the matrix starts with beginning_of_line = inode; (inode->mark & 1)
	// node->id is the human readable node number
	// node->vid is virtual node number in the reordered matrix
	//  such that the inode->vid is sequential for lines where (inode->mark & 2)
	// The end of each line in the matrix is marked by (inode->mark & 4)
	
	// These values were taken from a real-world problem.
	// for this example program, only one line of gaussian elimination is performed.
	//  I.e. the first line is subtracted from the subsequent 3 lines.
	//  because the vid of those three lines is the same as the vid of the first line (== 3)
	
	node0 = inode = &newnode;
	inode->id = 15;
	inode->vid = 3;
	inode->value = 3345.225615;
	inode->mark = 3;
	
	for(i=0; i<23; i++){
		jnode = malloc(sizeof newnode);
		inode->next = jnode;
		inode = jnode;
		switch(i){
			case 0:
				jnode->id = 11;
				jnode->vid = 4;
				jnode->value = -410.425615;
				jnode->mark = 0;
				break;
			
			case 1:
				jnode->id = 16;
			  jnode->vid = 5;
				jnode->value = -2900.000000;
				jnode->mark = 0;
				break;
				
			case 2:
				jnode->id = 18;
				jnode->vid = 6;
				jnode->value = -34.800000;
				jnode->mark = 4; // end of algebraic line
				break;
				
			case 3:
				jnode->id = 15;
				jnode->vid = 3;
				jnode->value = -410.425615;
				jnode->mark = 1;
				break;
					
			case 4:
				jnode->id = 11;
				jnode->vid = 4;
				jnode->value = -410.425615;
				jnode->mark = 6;
				break;
					
			case 5:
				jnode->id = 15;
				jnode->vid = 3;
				jnode->value = -2900.000000;
				jnode->mark = 1;
				break;
				
			case 6:
				jnode->id = 16;
				jnode->vid = 5;
				jnode->value = 3345.225615;
				jnode->mark = 2;
				break;
				
			case 7:
				jnode->id = 12;
				jnode->vid = 7;
				jnode->value = -410.425615;
				jnode->mark = 0;
				break;
				
			case 8:
				jnode->id = 17;
				jnode->vid = 8;
				jnode->value = -34.800000;
				jnode->mark = 4; // end of algegraic line
				break;
			
			case 9:
				jnode->id = 15;
			  jnode->vid = 3;
			  jnode->value = -34.800000;
			  jnode->mark = 1;
				break;
				 
			case 10:
				jnode->id = 18;
				jnode->vid = 6;
				jnode->value = 3345.225615;
				jnode->mark = 2;
				break;
				 
			case 11:
				jnode->id = 17;
				jnode->vid = 8;
				jnode->value = -2900.000000;
				jnode->mark = 0;
				break;
				 
			case 12:
				jnode->id = 14;
				jnode->vid = 9;
				jnode->value = -410.425615;
				jnode->mark = 4; // end of algebraic line
				break;
				 
			case 13:
				jnode->id = 16;
				jnode->vid = 5;
				jnode->value = -410.425615;
				jnode->mark = 1;
				break;
				 
			case 14:
				jnode->id = 12;
				jnode->vid = 7;
				jnode->value = 410.425615;
				jnode->mark = 6; // end of algebraic line (mark & 4 is true)
				break;
			
			case 15:
				jnode->id = 16;
				jnode->vid = 5;
				jnode->value = -34.800000;
				jnode->mark = 1;
				break;
				 
			case 16:
				jnode->id = 18;
				jnode->vid = 6;
				jnode->value = -2900.000000;
				jnode->mark = 0;
				break;
				 
			case 17:
				jnode->id = 17;
				jnode->vid = 8;
				jnode->value = 3345.225615;
				jnode->mark = 2;
				break;
				 
			case 18:
				jnode->id = 13;
				jnode->vid = 10;
				jnode->value = -410.425615;
				jnode->mark = 4;
				break;
				 
			case 19:
				jnode->id = 18;
				jnode->vid = 6;
				jnode->value = -410.425615;
				jnode->mark = 1;
				break;
				 
			case 20:
				jnode->id = 14;
				jnode->vid = 9;
				jnode->value = 410.425615;
				jnode->mark = 6;
				break;
			
			case 21:
				jnode->id = 17;
				jnode->vid = 8;
				jnode->value = -410.425615;
				jnode->mark = 1;
				break;
				 
			case 22:
				jnode->id = 13;
				jnode->vid = 10;
				jnode->value = 410.425615;
				jnode->mark = 6;
				break;
		}			
	}
	
	endnode = jnode;
	jnode->next = NULL;
	
	// printout of the matrix
	inode = node0;
	while(inode != NULL){
		printf("%d,%d,%d   ",inode->vid, inode->id, inode->mark);
		if(inode->mark & 4) printf("\n");
		inode = inode->next;
		}

	available = omp_get_max_threads(); // determine number of threads available
	beginning_of_line = malloc(sizeof (snode *) *available);
	
	// find the end of the first line in the matrix
	end_of_line1 = node0;
	while(!(end_of_line1->mark & 4)) end_of_line1 = end_of_line1->next;
	
	// subtract the first matrix line from all subsequent matrix lines
	// as long as the vid of the first line == the vid of the first node of the subsequent line
	// find the beginning of each line subsequent to the first line
	end_of_line = end_of_line1;
	while(end_of_line->next != NULL){ 
		for(nt=0; nt<available; nt++){
			beginning_of_line[nt] = end_of_line->next;
			printf("line 263 node0=%d; beginning[%d]=%d\n", node0->vid, nt, beginning_of_line[nt]->vid);
			if(beginning_of_line[nt]->vid != node0->vid) break;
			end_of_line = beginning_of_line[nt];
			while(!(end_of_line->mark & 4)) end_of_line = end_of_line->next;
			// found the end of line [nt];
			if(end_of_line->vid < end_of_line1->vid){
				// add another node such that line ends match
				// this is necessary to avoid collisions while parallel processing
				inode = malloc(sizeof newnode);
				inode->next = end_of_line->next;
				end_of_line->next = inode;
				end_of_line->mark ^= 4; // node is no longer the end of the line
				inode->mark = 4;
				inode->id = end_of_line1->id;
				inode->vid = end_of_line1->vid;
				inode->value = 0;
				end_of_line = inode;
			}
			if(end_of_line->next == NULL)break;
			if((end_of_line->next)->vid > node0->vid) break;
		}
		printf("line 284 value = %f, %f\n", node0->value, beginning_of_line[0]->value);
		if(nt == 0){printf("line 285 \n"); break;}
		printf("line 286 nt=%d\n", nt);
		
		printf("threads available = %d; threads used = %d\n", available, nt);
		#pragma omp parallel num_threads(nt)
		{
			id = omp_get_thread_num(); printf("line 286 node0=%d>%d, %f; line=[%d] = %d, %f\n", node0->id, node0->vid, node0->value, id, beginning_of_line[id]->vid, beginning_of_line[id]->value);
			gaussian(node0, beginning_of_line[id]);
		}
	}printf("line 294 exiting\n");
}

PS:
In this example, newnode.value is a single floating variable. In practice, "value" is six dimensional (three orthogonal, and three rotational).

NevemTeve · 02-15-2024, 11:26 AM

What do you expect to see?

Code:

$ ./paralell_test
3,15,3   4,11,0   5,16,0   6,18,4
3,15,1   4,11,6
3,15,1   5,16,2   7,12,0   8,17,4
3,15,1   6,18,2   8,17,0   9,14,4
5,16,1   7,12,6
5,16,1   6,18,0   8,17,2   10,13,4
6,18,1   9,14,6
8,17,1   10,13,6
line 263 node0=3; beginning[0]=3
line 263 node0=3; beginning[1]=3
line 263 node0=3; beginning[2]=3
line 284 value = 3345.225586, -410.425629
line 286 nt=2
threads available = 4; threads used = 2
line 286 node0=15>3, 3345.225586; line=[0] = 3, -410.425629
line 286 node0=15>3, 3345.225586; line=[1] = 3, -2900.000000
line 14
line 18
line 19 line1 = 3
line 20 line[id] = 3; -2900.000000
line 20a value = -2900.000000
line 14
line 18
line 19 line1 = 3
line 20 line[id] = 3; 0.000000
line 20a value = 0.000000
line 263 node0=3; beginning[0]=5
line 284 value = 3345.225586, -410.425629
line 285
line 294 exiting

piobair · 02-15-2024, 06:08 PM

(vid)value (vid)value (vid)value (vid)value
line1: (3)3345.225586 (4)-410.425629 (5)-2900.000000 (6)-34.799999
head[0] (3)-410.425629 (4)-410.425629 (new node value = 0)
head[1] (3)-2900.000000 (5)3345.225586 (7)-410.425629 (8)-34.799999
head[2] (3)-34.799999 (6)3345.225586 (8)-2900.000000) (9)-410.425629)

The object is to reduce head[id] (vid == 3)value to zero (gaussian elimination)
Expected results:
For head[0], v = 3345.225586 / -410.425629; (= -8.15063)
(3) = (-410.4 * -8.1506) -3345.2 = 0;
(4) = (-410.4 * -8.1506) -(-410.4) = -2934.8;
(5) = ((blank -> 0) * -8.1505) -(-29000) = 29000;
(6) = ((blank -> 0) * -8.1505) -(-34.8) = 34.8;
for head[1], v = 3345.225586 / -29000; (= -0.11535)
(3) = (-29000 * -0.11535) -3345.2 = 0;
(4) = ((blank ->0) * -0.11535) -(-410.4) = 410.4;
(5) = (3345.2 * -0.11535) -(-29000) = 28614;
(6) = ((blank -> 0) * -0.11535) -(-34.8) = 34.8;
(7) = (-410.4 * -0.11535) -(blank -> 0) = 47.34;
(8) = (-34.8 * -0.11535) -(blank -> 0) = 4.014;
for head[2], v = 3345.225586 / -34.8 (= -96.127)
(3) = (-34.8 * -96.127) -3345.2 = 0;
(4) = ((blank -> 0) * -96.127) -(-410.4) = 410.4;
(5) = ((blank -> 0) * -96.127) -(-29000) = 29000;
(6) = (3345.2 * -96.127) -(-34.8) = 321529;
(7) = ((blank -> 0) * -96.127) -(blank ->0) = 0;
(8) = (-29000 * -96.127) -(blank ->0) = 2787683;
(9) = (-410.4 * -96.127) -(blank ->0) = 39451;

Actual results:

(3)-410.425629 (4)-410.425629 (6)0.000000
(3)-nan (5)-inf (7)inf (8)inf
(3)-34.799999 (6)3345.225586 (8)-2900.000000 (9)-410.425629

For gaussian forward elimination, nodes head[0], head[1] and head[2] would be freed from the matrix
line1 = end_of_line1->next; loop until (line1 == NULL);

Quote:

Originally Posted by NevemTeve

What do you expect to see?

ntubski · 02-20-2024, 10:09 PM

Quote:

Originally Posted by piobair

Actual results:

(3)-410.425629 (4)-410.425629 (6)0.000000
(3)-nan (5)-inf (7)inf (8)inf
(3)-34.799999 (6)3345.225586 (8)-2900.000000 (9)-410.425629

I don't understand what format you are using here, or how this relates to the actual output of the program.