LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   +performance +gcc +elimination (http://www.linuxquestions.org/questions/programming-9/performance-gcc-elimination-223975/)

C++Boar 08-29-2004 02:54 PM

+performance +gcc +elimination
 
I have triying to make some fast code, when I found out that gcc with O3 did not help med
I have an object r with a getDirInfo. It tells about a property the object has. It must be between 0 and 27. (s is another object)

Now I loop over it and tells gcc to unroll loops (c500() should not be unrolled, since they are class members).

Now I would except gcc to unroll the d-loop and generate 27 times the
code and then in different code eat up nogood ifs (since they are always false)

But either I am wrong or I am passing the wrong options to gcc. I
use gcc version 3.3.2 and have tried with fun-roll-loop
(and a lot of other things)

I have not seen the asm code for this. I only know this to be a problem since I have tried
experiments with d=26 and it takes somewhat longer than if just the code for d=26 was there.

Hope somebody can help.



for (int v=0;v<c500();v++)
{
for (int d=0;d<27;d++)
{
if (d==r.getDirInfo())
{
for (int x=0;x<c500();x++)
for (int y=0;y<c500();y++)
{
r.p.setX(x);
r.p.setY(y);
bool first = false;
bool last = false;
realnum hit1,hit2;

bool b;
if (d==0) { b= s.xmax()<r.p.x() || s.ymax()<r.p.y() || s.zmax()<r.p.z();}
if (d==1) { b= s.xmax()<r.p.x() || s.ymax()<r.p.y() || s.zmin()>r.p.z();}
if (d==2) { b= s.xmax()<r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z();}
if (d==3) { b= s.xmin()> r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z();}
//..... and lots of others up to 27.

if (b)
do something.

}
}
}
}

jim mcnamara 08-30-2004 11:56 AM

If I understand what you want -

You do know that getting information about a directory requires disk I/O -
which is very slow. Unrolling loops will not help all that much. I can't see enough of your code to know if you're calling a function to get directory information or not.

Plus, using code tags would help us read your code.

If your program runs very slowly, try compiling for profiling ( -g -p) and then use a profiler like gprof to find where your problem is.

C++Boar 08-30-2004 01:15 PM

You must have misunderstood. I am not doing any I/O. I do math.
I have a vector (a,b,c). This vector can have 27 forms
a is positive, a is negative and a is zero (same for b and c (= 3*3*3 = 27 possibilities))

Now I want to check something (s) with that vector called r in my code.
The code will do that a lot. So it must be real fast.

So I create my normal loops so that they can not be unrolled (that means they do not have a known endpoint at compiletime c500() =500 but unknown at compiletime). But the one with the constant 27 I want that one unrolled right away.

By that I mean that I want the compiler to write that code 27 times with integer-constants instead of d. Then I want the compiler to realize that 26 of the 27 ifs in the code will always be false and forget about them. There should then only be one true if-statement per unrolled value.
But I can see on execution time that that is not happening.

It should be so, but I I don't think it is possible (with gcc) :(

I can work around this in several ways but unless I use macros (with no chance to debug) or maintain 27 different 'almost the same' code (Or make some code to create code). And I don't like any of the above.

Maybe I should contact some of these gcc-guys.

jim mcnamara 08-30-2004 03:47 PM

Okay. I understand - you want to execute one comparison, if (d==?) then do one set of operations per loop, and NOT do the other 26 compares?

Correct?

use switch()
Code:

switch(d){
    case 0: { b= s.xmax()<r.p.x() || s.ymax()<r.p.y() || s.zmax()<r.p.z(); break;}
    case 1: { b= s.xmax()<r.p.x() || s.ymax()<r.p.y() || s.zmin()>r.p.z(); break;}
    case 2: { b= s.xmax()<r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z(); break;}
    case 3: { b= s.xmin()> r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z();break;}
    case ...
    case 27:{.........}
    default:
          break;
}


jim mcnamara 08-30-2004 03:55 PM

Another option - (assuming all 27 operations are unique) create 27 functions that handle your operations. Then create an array of function pointers:
Code:

typedef int (*pt2Function)(struct r, int, int);
pt2Function funcs[27];
...............
for(d=0;d<27){
...............
  func[d](arg1,arg2,arg3);
..........
}


jim mcnamara 08-30-2004 04:15 PM

Also: heuristics play part in this solution, not just brute force programming
Code:

b= s.xmin()> r.p.x() || s.ymin()>r.p.y() || s.zmax()<r.p.z();
The compiler evaluates logical or expressions from left to right.
It quits evaluation when it finds a logical true. If you have a priori knowledge that for 40% of the cases
Code:

s.zmax()<r.p.z()
is more likely to evaluate true, put it first. If you know that it is more unlikely, put it last.

Also,
Code:

s.zmax()<r.p.z()
is calling functions, you definitely want to get each of those f(x)'s inlined. Big time.

You REALLY should profile this before you spend time worrying about if() statements. if(d==3) takes like 3 opcodes. Each of those class functions takes at least 20 opcodes for housekeeping.

C++Boar 09-03-2004 12:48 AM

Thanks.

The case: actually made it much faster. However not as fast as if the compiler had expanded the loop correctly and skiped all the other cases (or if-statements) except the one that was needed.

I have tried pointer to funktion and virtuel functions as well, but the call of a function is also expensive. And I really don't want 27 classes or functionpointers.

I know the other techniques you tell about. (s.xmax(), r.p.x() (and others) are inlined. In facf they only return a private member variable) Also I don't know anything about what will be more likely.

but gcc could improve here. This should be a common way to optimize you code without writting almost the same x (for my case 27) times.

C++Boar


All times are GMT -5. The time now is 03:11 PM.