+performance +gcc +elimination
I have triying to make some fast code, when I found out that gcc with O3 did not help med
I have an object r with a getDirInfo. It tells about a property the object has. It must be between 0 and 27. (s is another object) Now I loop over it and tells gcc to unroll loops (c500() should not be unrolled, since they are class members). Now I would except gcc to unroll the d-loop and generate 27 times the code and then in different code eat up nogood ifs (since they are always false) But either I am wrong or I am passing the wrong options to gcc. I use gcc version 3.3.2 and have tried with fun-roll-loop (and a lot of other things) I have not seen the asm code for this. I only know this to be a problem since I have tried experiments with d=26 and it takes somewhat longer than if just the code for d=26 was there. Hope somebody can help. for (int v=0;v<c500();v++) { for (int d=0;d<27;d++) { if (d==r.getDirInfo()) { for (int x=0;x<c500();x++) for (int y=0;y<c500();y++) { r.p.setX(x); r.p.setY(y); bool first = false; bool last = false; realnum hit1,hit2; bool b; if (d==0) { b= s.xmax()<r.p.x() || s.ymax()<r.p.y() || s.zmax()<r.p.z();} if (d==1) { b= s.xmax()<r.p.x() || s.ymax()<r.p.y() || s.zmin()>r.p.z();} if (d==2) { b= s.xmax()<r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z();} if (d==3) { b= s.xmin()> r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z();} //..... and lots of others up to 27. if (b) do something. } } } } |
If I understand what you want -
You do know that getting information about a directory requires disk I/O - which is very slow. Unrolling loops will not help all that much. I can't see enough of your code to know if you're calling a function to get directory information or not. Plus, using code tags would help us read your code. If your program runs very slowly, try compiling for profiling ( -g -p) and then use a profiler like gprof to find where your problem is. |
You must have misunderstood. I am not doing any I/O. I do math.
I have a vector (a,b,c). This vector can have 27 forms a is positive, a is negative and a is zero (same for b and c (= 3*3*3 = 27 possibilities)) Now I want to check something (s) with that vector called r in my code. The code will do that a lot. So it must be real fast. So I create my normal loops so that they can not be unrolled (that means they do not have a known endpoint at compiletime c500() =500 but unknown at compiletime). But the one with the constant 27 I want that one unrolled right away. By that I mean that I want the compiler to write that code 27 times with integer-constants instead of d. Then I want the compiler to realize that 26 of the 27 ifs in the code will always be false and forget about them. There should then only be one true if-statement per unrolled value. But I can see on execution time that that is not happening. It should be so, but I I don't think it is possible (with gcc) :( I can work around this in several ways but unless I use macros (with no chance to debug) or maintain 27 different 'almost the same' code (Or make some code to create code). And I don't like any of the above. Maybe I should contact some of these gcc-guys. |
Okay. I understand - you want to execute one comparison, if (d==?) then do one set of operations per loop, and NOT do the other 26 compares?
Correct? use switch() Code:
switch(d){ |
Another option - (assuming all 27 operations are unique) create 27 functions that handle your operations. Then create an array of function pointers:
Code:
typedef int (*pt2Function)(struct r, int, int); |
Also: heuristics play part in this solution, not just brute force programming
Code:
b= s.xmin()> r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z(); It quits evaluation when it finds a logical true. If you have a priori knowledge that for 40% of the cases Code:
s.zmax()<r.p.z() Also, Code:
s.zmax()<r.p.z() You REALLY should profile this before you spend time worrying about if() statements. if(d==3) takes like 3 opcodes. Each of those class functions takes at least 20 opcodes for housekeeping. |
Thanks.
The case: actually made it much faster. However not as fast as if the compiler had expanded the loop correctly and skiped all the other cases (or if-statements) except the one that was needed. I have tried pointer to funktion and virtuel functions as well, but the call of a function is also expensive. And I really don't want 27 classes or functionpointers. I know the other techniques you tell about. (s.xmax(), r.p.x() (and others) are inlined. In facf they only return a private member variable) Also I don't know anything about what will be more likely. but gcc could improve here. This should be a common way to optimize you code without writting almost the same x (for my case 27) times. C++Boar |
All times are GMT -5. The time now is 03:29 PM. |