Strange performance

Marius2 · 12-27-2004, 07:32 AM

Your image data doesn't happen to lie in the graphics adapter memory? Reading
over the AGP bus is much slower than from RAM, I think four times as slow is
a plausible value here.

Marius2 · 12-27-2004, 07:37 AM

Oops. Or does the image happen to lie to the graphics adapter :-)? I meant:
... doesn't happen to reside in the graphics adapter memory?

leonidg · 12-27-2004, 07:41 AM

Quote:

Originally posted by Marius2
Your image data doesn't happen to lie in the graphics adapter memory? Reading
over the AGP bus is much slower than from RAM, I think four times as slow is
a plausible value here.

To tell You the trugh I have no idea. is there an option to check it? I am using some libraries like sdl and openCV but the data is created with simple malloc

wapcaplet · 12-27-2004, 12:57 PM

Quote:

Originally posted by leonidg
To tell You the trugh I have no idea. is there an option to check it? I am using some libraries like sdl and openCV but the data is created with simple malloc

If it's created with malloc, then it probably resides in main memory (on the heap, to be more specific), unless you're using functions to specifically utilize memory in the graphics adapter (which isn't likely, since nowadays most of that stuff seems to be handled by the video driver).

Quote:

Friends, wapcaplet, I have another hint that might guide You helping me with my problem. I have tried to calculate only the time for the for-loop, it is indeed requires 4 msec. Now what I tried is copying my arrays to another temporary arrays and then making the difference on the temporary arrays and the performance of the for loop became 1 msec. So now I believe it is for sure a problem with my arrays locations.

Maybe you should show some code on this one... I'm not too sure what you mean by copying the arrays to temporary locations. You mean you're making complete copies of each of the three arrays before entering the for loop?

Quote:

It has a simple function cvSub which makes what I need. And what can I say, in all the places where I tested it it works less than 1 msec, and here it works almost 4

So what does the cvSub function do differently? How is it different from your function? Find the answer to that, and I think you'll find out why yours is slower.

I tried a revised version of itsme86's test code:

Code:

#include <stdio.h>
#include <sys/time.h>

#define MAX 1000

int x[640*480], y[640*480], z[640*480];
void tvdiff(struct timeval *tv1, struct timeval *tv2)
{
  tv2->tv_sec -= tv1->tv_sec;
  tv2->tv_usec -= tv1->tv_usec;
  if(tv2->tv_usec < 0)
  {
    tv2->tv_usec += 1000000;
    tv2->tv_sec++;
  }
}

void way_one()
{
  int i;

  int *a = &x[0];
  int *b = &y[0];
  int *c = &z[0];
  for(i = 0;i < 640*480;i++)
    a[i] = b[i] - c[i];
}

void way_two()
{
  int i;
  int *a = &x[0];
  int *b = &y[0];
  int *c = &z[0];

  for(i = 0;i < 640*480;i++, a++, b++, c++)
    *a = *b - *c;
}

int main(void)
{
  struct timeval start, end, start_empty, end_empty;
  int i;
  double one_time;
  
  gettimeofday(&start, NULL);
  for ( i = 0; i < MAX; i++ )
    {
      way_one();
    }
  gettimeofday(&end, NULL);
  tvdiff(&start, &end);
  one_time = 1.0 * ( end.tv_sec * 1000000 + end.tv_usec ) / MAX; 
  printf("way one diff: %f microsec\n", one_time);

  gettimeofday(&start, NULL);
  for ( i = 0; i < MAX; i++ )
    {
      way_two();
    }
  gettimeofday(&end, NULL);
  tvdiff(&start, &end);
  one_time = 1.0 * ( end.tv_sec * 1000000 + end.tv_usec ) / MAX; 
  printf("way one diff: %f microsec\n", one_time);

  return 0;
}

This example seems a little closer to what you are doing (since the arrays are created outside the function, and only referenced inside the function, and since it compares array-style indexing with pointer arithmetic).

Output, compiling with -O2 (no optimization) and 1000 iterations:

Code:

way one diff: 7559.681000 microsec
way one diff: 6582.963000 microsec

way one diff: 7419.159000 microsec
way one diff: 7998.531000 microsec

way one diff: 6973.006000 microsec
way one diff: 7994.543000 microsec

With 5000 iterations:

Code:

way one diff: 5659.415800 microsec
way one diff: 6140.025200 microsec

So it looks like it can vary quite a bit, but I am getting the same general range as you, with 5-8 milliseconds per iteration. Again, look at how other programs do it, and see what they are doing differently. I'm convinced that there is a better, more efficient way to do it, but it may take some experimenting to find it. Try some different approaches to doing blocks of the array at once (I tried doing one line at a time, copying the intermediate results to a temporary array, but it ended up taking a little longer). See if you can find any way to combine pixels that are the same color. Try using XOR or other bitwise operations instead of subtraction. Try doing an in-place difference (i.e., difference between A and B, putting the results back in A). There are a lot of ways to approach it, and some are bound to be better than others.

Edit: in-place difference seems to be marginally faster; I'm getting about a 1-2msec reduction by doing a[i] = a[i] - b[i]. If you can rearrange things so this is possible, it may save you some time.

itsme86 · 12-27-2004, 01:26 PM

Just for curiosity's sake I took a peek at the cvSub() source code which I found here: http://www.itu.dk/stud/projects_f200...c/cvarithm.cpp

It looks like they do:

Code:

{
  const float* src1data = (const float*)(src1->data.ptr);
  const float* src2data = (const float*)(src2->data.ptr);
  float* dstdata = (float*)(dst->data.ptr);
                
  do
  {
    dstdata[size.width-1] = (float)
      (src1data[size.width-1] - src2data[size.width-1]);
  }
  while( --size.width );

  EXIT;
}

I'm guessing? Maybe that's helpful to you?

wapcaplet · 12-27-2004, 05:07 PM

Quote:

Originally posted by itsme86
Just for curiosity's sake I took a peek at the cvSub() source code which I found here:

I tried finding that with no success. Interesting, though; it starts at the end and works its way to the beginning. Good idea, using 0 as the exit condition, since it probably saves a comparison for every loop iteration. Instead of doing a comparison (which, in assembly, could translate to two fetches, a subtraction, and then a zero-test), it's just a zero-test (which means just one fetch, then a zero-test). That may be the kind of thing that an optimizing compiler could do for you, but it's a good example of how understanding the lower-level operations can help you improve efficiency.

Testing this approach with the code above, though, it takes a little bit longer--about 3-5 msec longer than the alternative. So I guess my theory is not true in this case

wapcaplet · 12-27-2004, 05:22 PM

I just found a really good optimization for you. It may be really good or not, depending on the contents of your arrays. It relies upon my earlier suggestion of doing in-place difference, but check this out:

Code:

void way_one()
{
  int i;
  int *a = &x[0];
  int *b = &y[0];
  int *c = &z[0];

  for(i = 0;i < 640*480;i++)
    a[i] = b[i] - c[i];
}

void way_two()
{
  int i;
  int *a = &x[0];
  int *b = &y[0];
  int *c = &z[0];

  for(i = 0;i < 640*480;i++)
    {
      if ( c[i] )
        a[i] -= c[i];
    }
}

And the output (for 5000 iterations):

Code:

way one diff: 5867.575600 microsec
way one diff: 1998.949400 microsec

Here, there's no point in subtracting c[i] from a[i] if c[i] is zero. So, c[i] is tested for zeroness first. It's an extra test condition, but it appears to run faster, especially on an array of all zeros (which these arrays happen to be). Even with an array that is say, half full of zeros, it could give you a big speedup. Testing it on arrays with a bunch of numbers in them (half the elements 0, half of them something else), I get these results:

Code:

way one diff: 8216.811000 microsec
way one diff: 6675.310200 microsec

Still, a decent speed-increase. If you can't do in-place difference (that is, you have to do a = b - c), then you could include some extra conditions to avoid unnecessary calculation. Here, I've accounted for a few different possibilities:

Code:

void way_two()
{
  int i;
  int *a = &x[0];
  int *b = &y[0];
  int *c = &z[0];

  for(i = 0;i < 640*480;i++)
    {
      if ( c[i] )
        if ( b[i] )
          a[i] = b[i] - c[i];
        else
          a[i] = -c[i];
      else
        a[i] = b[i];
    }
}

It looks like a lot of extra code, but using some tests to avoid unnecessary calculation (and then handling it appropriately), a little time is saved. Here are my results with this approach:

Code:

way one diff: 8656.376400 microsec
way one diff: 7822.348200 microsec

Again, a minor speedup. You might try experimenting with this sort of thing, to see if it helps you.

leonidg · 12-28-2004, 01:30 AM

Hello, friends, thank You for such nice replies, I see that I really have a lot to learn from You, but I think we turned a little from my question. I believe that I have a problem with my array and not the code of the function, so even if we find now the best optimized function and it would solve the problem, the problem may appear again in another place, so much more important for me to find in what array is the problem and how to repair it.
I am sorry that I have not given You more details and that is the reason I see some suggestions that I cannot really implement, it is just a bit secret project and I preferred not to talk about the details, but let me give You some more information:
The two arrays we are making difference (one from another) are two different captures from camera, so there is almost no chance that will be much zero values. The two arrays should not be changed during the whole loop of the program, for possible other processings with them, so in-place difference is not an issue as well. The difference array is an important array which is created in the problematic function and is also used during the whole problem.
So as You can understand the 3 arrays, are allocated in the beginning of the program and just refilled each time there is new data from camera, that is what frightens me, because if one of them has a problem it will have it during the whole program.
Above are also the reasons why I cannot check your code for 1000 iterations, because the difference should be done once for a loop, and if it is done in another place it would as fast as I need, but would not answer my question. What I can do and did, is calculating the function time for more than one loop and making average.
Last issue, regarding the new arrays that I have created, was just allocating new arrays, and copying all the data with memcpy, to the new arrays and then testing the function with them, and as expected it took 1 msec. but the problem that allocating and copying 3 arrays takes too much time so it can not be a solution, just a hint to a problem.
By the way, in a chance I would like to ask You some questions about gpu, cause it is one of the issues I want to try next, and have no idea about how to use it.
Thank You a lot for your patience,
Leonid

leonidg · 12-28-2004, 01:39 AM

One issue I have not answered was about the cvSub function. I will try now the function You have found, but it is for sure not the function I was talking about, You can find my function only by downloading openCV, which is easy to find, and looking in it's code. But their code is not the best for changing, maybe I should try it once, but as I see it now, a lot of strange code lines that I cannot understand why we need them, and obviously that is what giving the fast performance, but again, the same function with my problematic arrays gave more than 3 msec, which is bad for me.

itsme86 · 12-28-2004, 01:39 PM

When you mentioned memcpy() it made me remember how much more efficient x86 processors are at dealing with ints than dealing with other data types. So I thought maybe I could trick it and only work with ints so I came up with this:

Code:

#define R1(x) ((x) & 0xFFFF)
#define R2(x) (((x) & 0xFFFF0000) >> 16)

#define L1(x) (x)
#define L2(x) ((x) << 16)

void way_four(unsigned short *a, unsigned short *b, unsigned short *c)
{
  int *d = (int *)a, *e = (int *)b, *f = (int *)c;
  int i;

  for(i = 0;i < 640*480/2;i++, d++, e++, f++)
    *f = L1(R1(*d) - R1(*e)) | L2(R2(*d) - R2(*e));
}

It works but, sadly, it's about 50% less efficient than other methods posted. So to anyone out there that thought of doing the same thing, good luck with it

wapcaplet · 12-28-2004, 05:32 PM

Quote:

Originally posted by leonidg
So as You can understand the 3 arrays, are allocated in the beginning of the program and just refilled each time there is new data from camera, that is what frightens me, because if one of them has a problem it will have it during the whole program.

OK, that's a legitimate concern. There may still be some optimizations to be had in the algorithm itself, though; those can benefit you regardless of other problems that may be present in storing the arrays themselves.

Quote:

Last issue, regarding the new arrays that I have created, was just allocating new arrays, and copying all the data with memcpy, to the new arrays and then testing the function with them, and as expected it took 1 msec. but the problem that allocating and copying 3 arrays takes too much time so it can not be a solution, just a hint to a problem.

True, but I don't know that this necessarily points to a problem with how the arrays are allocated. Since you are utilizing a direct pointer to the arrays' addresses, it should be just as efficient to use that pointer as to use a local (to that function) copy of the array data. It's conceivable that the local copies are somehow easier to access than the global arrays, but I suspect that the faster access has more to do with how well the CPU cache is being used. When you make a complete copy of the three arrays, you're probably causing most of the array data to be pulled into the CPU cache, where it can be accessed more quickly.

One way to test this theory is to test your for-loop twice; see if the second run goes faster than the first. If they are about the same, then you can probably rule out caching as the reason for the speedup.

You have mentioned that an earlier version of your program was able to execute in 1msec, but I don't think you specified what differences there are between the 4msec version and the 1msec version. If you are consistently getting slower results with the newer version, then something in your code is causing it. Find the differences in your code, and you will find the possible causes for the slowdown. Use the 'diff' utility to help you compare different versions of the source code. Are the arrays being allocated differently? Are they being filled with different data, or different amounts of data? You can throw all the tests you want at the code, but finding the differences, in the code, between the 1msec version and the 4msec version is the only foolproof way to find out why one is slower than the other.

Quote:

By the way, in a chance I would like to ask You some questions about gpu, cause it is one of the issues I want to try next, and have no idea about how to use it.

I'm afraid I don't know anything about GPUs. And honestly, what I know about CPUs, registers, and caches is very basic and general; my best suggestion for learning about these issues on your own is to study assembly language.

leonidg · 12-29-2004, 01:30 AM

Hello, wapcaplet, friends,

Quote:

One way to test this theory is to test your for-loop twice; see if the second run goes faster than the first

I have tested it, in the same function, I have wrote 2 loops, one is taking 1 msec (temp arrays), the second 4 msec. More interesting I have tried to give to that function another global arrays, it took 1 msec. I am thinking now about the worst solution of making two copies of everything in the beginning of the program and, making the system test itself sometimes and if some parts are wrong to change copy.

Quote:

Find the differences in your code, and you will find the possible causes for the slowdown

As I have mentioned earlier, the difference between the version is in functions that have no connection to that function, maybe except the fact that they are in the same class. I have added some functions to the class, for dealing with other, color 24 bit images, and also added a lot of comments, but that is I believe for sure should not be a problem.
All the arrays are allocated in the same way. These 3 arrays are class variables.

Quote:

When you make a complete copy of the three arrays, you're probably causing most of the array data to be pulled into the CPU cache

Is there a chance that because of some changes in other parts of program the same arrays are not pulled into CPU cache because of the changes?

Quote:

I'm afraid I don't know anything about GPUs

What I ment is programming using openGL or something similar, which as I have heard makes some parts of code much faster, but anyway I believe it would be a good idea to open a new thread for that task.
Thanks,
Leonid

Marius2 · 12-29-2004, 02:43 AM

Quote:

Originally posted by leonidg
Hello, wapcaplet, friends,

I have tested it, in the same function, I have wrote 2 loops, one is taking 1 msec (temp arrays), the second 4 msec. More interesting I have tried to give to that function another global arrays, it took 1 msec. I am thinking now about the worst solution of making two copies of everything in the beginning of the program and, making the system test itself sometimes and if some parts are wrong to change copy.

I think wapcaplet meant something like this:
int c=0;
for(c=0;c<2;c++){
//Initialize pointers to arrays
.
//get a msec accurate time now
t_start=...
//Do the 4ms version
//get msec accurate time again
t_end=...
printf("Execution #%d took %d ms\n",c,(t_end-t_start));
}

If the second run is significantly faster than the first run, this would probably point to the cpu cache
as an explanation of the 3ms gap. BTW my personal (wild) guess is that the problem is within
the code the compiler generates, I suspect it somehow f**** up the addressing of the arrays.
My suggestion: write the inner loop of your function with the inline assembler, use something like
mov ESI, ptr_src_array1
mov EAX,ptr_src_array2
mov EDI,ptr_dst_array
mov EDX,element_counter
loop1:mov EBX,[ESI]
mov ECX[EAX]
sub EBX,ECX
mov [EDI],EBX
add ESI,2
add EAX,2
dec EDX
jne loop1

Marius2 · 12-29-2004, 02:50 AM

If at first you don't succeed... you're a programmer. Of course it should read

mov ESI, ptr_src_array1
mov EAX,ptr_src_array2
mov EDI,ptr_dst_array
mov EDX,element_counter
loop1:mov EBX,[ESI]
mov ECX[EAX]
sub EBX,ECX
mov [EDI],EBX
add ESI,4
add EAX,4
add EDI,4
dec EDX
jne loop1

or read into BX and CX and add 2 to ESI,EAX,EDI for sixteen bpp.

leonidg · 12-29-2004, 03:17 AM

Dear Marius2,
Thanks for your effort to help. I believe that assembler may solve a lot of problems, the problem is that your code is the first code in assembler I ever seen, and unfortunately, I have no idea what to do with it. In what place to copy and how to compile and/or run. I would really like to learn more about assembler, but it is future plans, for now I don't have time for that. But from other side I really liked your explanation about "f**** up", I think it is the reason, and I wonder if there is a possibility to check it somehow.
Thanks
Leonid