Strange performance

leonidg · 12-26-2004, 04:36 AM

Hello.
I am using fedora core 1, programming a large system in c++.
Recently I have noticed that one part of my program is taking too much time, and I have no idea how to fix it. The part is responsible for making difference image between 2 images.
The code is:
-------------------------------------------------------------------------
void
Frame::MakeDiff() {
//substract frame1 from frame 0
int w,h,bpp;
int i,len;
unsigned short *s0,*s1, *sd;

mFr0->GetSizes(w,h,bpp);//w=640, h=480, bpp=12
len = w*h;
s0 = (unsigned short*)mFr0->GetPtr();//pointer to im1 data
s1 = (unsigned short*)mFr1->GetPtr(); pointer to im2 data
sd = (unsigned short*)mFrD->GetPtr();pointer to im dif data
for (i=0; i<len; i++,s0++,s1++,sd++)
*sd = *s0 - *s1;
}
-------------------------------------------------------------------------
running the code takes 10 msec.
If I change last 2 lines of the function to:
-------------------------------------------------------------------------
for(i=0; i<len; i++)
sd[i] = s0[i] - s1[i];
-------------------------------------------------------------------------
the function takes 4 msec. It is strange for me too, because seems that the same job is done.
Originally the function should take and took 1 msec on my system. I believe that maybe the problem is in adding other parts of my system which are not connected to that function but I have no idea how to find and deal it.
Please help me if You have any ideas what could cause such problems and/or You know how to deal it, or maybe any programs with memory which will guide me to the root of the problem.
Thanks in Advance,
Leonid

aluser · 12-26-2004, 02:10 PM

Are you using any optimization flags? It could be that adding -O2 or -O3 to your compile will cause the two to compile to the same thing. Your second way is more readable anyway, so if it's at least as fast I'd stick to it.

I almost want to say that there are x86 instructions which take advantage of using an array plus offset, which would explain why the second way is faster -- I'm not sure if I remember that correctly though.

In any case, if you aren't using optimization flags now you may find that -O3 and/or -funroll-loops makes this sort of code a lot faster.

wapcaplet · 12-26-2004, 02:17 PM

If the function used to take 1msec, what have you added or changed since then? A loop going through 640*480 iterations is going to take a little while.

The reason this:

Code:

for(i=0; i<len; i++)
  sd[i] = s0[i] - s1[i];

is faster than this:

Code:

for (i=0; i<len; i++,s0++,s1++,sd++)
  *sd = *s0 - *s1;

is probably because you are doing 6 operations per loop iteration in the second case (i++, s0++, s1++, sd++, *s0-*s1, and an assignment), whereas in the first case you are doing only 3 operations per loop iteration (i++, s0[i]-s1[i], and an assignment).

aluser · 12-26-2004, 02:31 PM

I would think that s0[i] - s1[i] should be considered more than "one operation" because it's equivalent to *(s0 + i) - *(s1 + i), yes?

wapcaplet · 12-26-2004, 03:08 PM

Yeah, that's true. I don't know the details of how the compiler would handle that, or whether there's a way to optimize it; even so, it may be the case that the operations involved in array indexing are less costly than the operation of incrementing several pointers.

When using an array index, the compiler may be able to condense *(pointer + offset) into a single assembly instruction, thus making it equivalent, in terms of number of instructions, to using *(pointer) (and probably also equivalent in terms of execution time). Then, the only difference between the two approaches would be the number of other operations done, which is greater in the direct-pointer-arithmetic approach in this situation.

The cache cost of maintaining four incrementing variables (i, s0, s1, and sd) instead of only one (i) may also have an impact; using four, the CPU may not be able to keep all of them in registers, which would mean more I/O from the CPU cache. With only one (i), it can most likely stay in a register, and save time by not going out to the cache as frequently.

leonidg · 12-27-2004, 01:26 AM

seems like your reason, wapcaplet, but still I really want it to work 1 msec. I have other functions with the image, like thresholding which take 1msec, and it doesn't look nice when that function needs much more. Some remarks, I was using -O2 -funroll-loops optimization. I have tried now -O3 it doesn't change that part. The problem is that when I am testing an older version of my program I have there performance of 1msec for that code and I have no idea how to find what causes that change. Another strange thing that I have noticed is if I change only one pointer (not all as I showed) for example:
-------------------------------------
for (i=0; i<len; i++,s1++,sd++)
*sd = s0[i] - *s1;
-------------------------------------
The performance changes from 10 to 4 msec as well. I believe wapcaplet explains it well. But how can I make t back working 1 msec.

Thanks for Your help
Leonid

leonidg · 12-27-2004, 01:34 AM

Quote:

Originally posted by wapcaplet
If the function used to take 1msec, what have you added or changed since then? A loop going through 640*480 iterations is going to take a little while.

The changes I have done have no connection to that part of code. Mostly it is comments which made my project twice the size of the earlier, some changes on gui, and addition to processing functions for other images (color images), but it is not done in the same time so I believe and hope it should not affect.
As I mentioned processing of the same image in other part of the code takes 1 msec (thresholding).
Leonid

itsme86 · 12-27-2004, 01:38 AM

Why not do:

Code:

for (i=0; i<len; i++)
  *sd++ = *s0++ - *s1++;

?

leonidg · 12-27-2004, 01:41 AM

Quote:

Originally posted by itsme86
Why not do:

Code:

for (i=0; i<len; i++) *sd++ = *s0++ - *s1++;

?

I have tried, the same performance of 4 msec. I believe the reason should be in other place, because that code worked faster and works faster now in previous version of my project. My question is have to find what causes the problem, if it is possible.
Thanks,
Leonid

itsme86 · 12-27-2004, 01:51 AM

Here's a little test program I made to test the speed difference:

Code:

itsme@dreams:~/C$ cat spleen.c
#include <stdio.h>
#include <sys/time.h>

void tvdiff(struct timeval *tv1, struct timeval *tv2)
{
  tv2->tv_sec -= tv1->tv_sec;
  tv2->tv_usec -= tv1->tv_usec;
  if(tv2->tv_usec < 0)
  {
    tv2->tv_usec += 1000000;
    tv2->tv_sec++;
  }
}

void way_one(int *a, int *b, int *c)
{
  int i;

  for(i = 0;i < 640*480;i++)
    *a++ = *b++ - *c++;
}

void way_two(int *a, int *b, int *c)
{
  int i;

  for(i = 0;i < 640*480;i++, a++, b++, c++)
    *a = *b - *c;
}

int main(void)
{
  int a[640*480], b[640*480], c[640*480];
  struct timeval start, end;

  gettimeofday(&start, NULL);
  way_one(a, b, c);
  gettimeofday(&end, NULL);
  tvdiff(&start, &end);
  printf("way one diff: %d.%06d\n", end.tv_sec, end.tv_usec);

  gettimeofday(&start, NULL);
  way_two(a, b, c);
  gettimeofday(&end, NULL);
  tvdiff(&start, &end);
  printf("way two diff: %d.%06d\n", end.tv_sec, end.tv_usec);

  return 0;
}

Code:

itsme@dreams:~/C$ ./spleen
way one diff: 0.022265
way two diff: 0.014265

So you can see that the second way (the way I suggested in my previous post) is nearly twice as fast.

EDIT: After compiling with -O3 and -funroll-loops I get:

Code:

way one diff: 0.014055
way two diff: 0.011169

leonidg · 12-27-2004, 02:00 AM

Again, Thanks for your effort to help, but I believe the problem is not in the code, I have tried all the variants, and the best I can get is 4 msec, when all that variants in previous version give 1 msec. I just need to know if it is possible to find the reason, maybe using some testing programs.
Thanks,
Leonid

leonidg · 12-27-2004, 03:43 AM

wapcaplet, I noticed that You are familiar with the registers issue. Do You think maybe they are not empty before I begin running the code because of the new added functions to the project, and that is why the program needs more time. Is there an option to check it and maybe to empty them.
By the way itsme86, i have tested your code, the result is about 1600 which mean if I understand right less than 2msec which is good enough for me, but unfortunately it does not happen in my program.
Thanks,
Leonid

wapcaplet · 12-27-2004, 06:14 AM

Quote:

Originally posted by leonidg
Again, Thanks for your effort to help, but I believe the problem is not in the code, I have tried all the variants, and the best I can get is 4 msec, when all that variants in previous version give 1 msec. I just need to know if it is possible to find the reason, maybe using some testing programs.

Well, the problem has to be in the code, though. If that function is exactly the same, then try looking for changes elsewhere that might have affected this function. You should reason each part of the code through to see whether it might be causing the delay; you've obviously already found the for-loop in question that can be optimized. Look at all the other function calls that are occurring in that function, to see if they could be contributing significantly to the time; do any of them have for loops in them? There's no data being passed to this function, but in general you should also look for changes in the function arguments or its input. Variables are...variable

You may also be able to find a more efficient way to approach the problem of doing an image difference. I wouldn't know for sure, but it's possible that doing a simple subtraction on every pixel is not the most efficient way. Maybe it would work faster if you somehow did blocks of pixels, or found a way to skip over pixels that are the same color as previous ones. Try looking at the code for other open-source programs that do this (like the GIMP or ImageMagick), to see if they have a better way to do it.

One way to get more accurate timing for an individual command is something I learned in a college course; put the code you want to test inside a for loop for say, 1000000 iterations. Time it, then time it again with an empty for loop (to find out how much overhead is introduced by the loop itself). Divide them by 1000000 to find averages. Subtract the second from the first; the result is a good estimate of how long the code in question takes, down to much greater resolution than timing only one execution of it. You might discover that, on average, the function takes less than 4msec; the only way to be sure is to do lots of iterations.

To answer your question about registers, their state before entering the function would have little effect on its execution time, since they would be filled appropriately (if your compiler is good at optimization, which gcc certainly is). Filling the registers with needed values one time is very fast, but in my earlier hypothesis, some register is having its data overwritten and re-fetched from cache for every iteration of the loop. Fetching from cache is fast, but not nearly as fast as just using a register (and, if there's some chance your CPU cache is full, it might even go out to RAM, which is like going on vacation for two weeks compared to the speed of the registers). At any rate, there's not much you as the programmer can do to improve register efficiency; gcc is quite good at optimizing that kind of thing, so unless you have a very detailed understanding of how to improve that efficiency (and want to put chunks of assembly language in your code), it is probably not worth your while at this point. With a mission-critical application, like the Linux kernel, say, then it may be worthwhile to get down to that level, but for image compositing, it's probably not going to make much difference (at least in terms of how the end user of your program perceives its performance).

leonidg · 12-27-2004, 06:52 AM

wapcaplet, thanks for your letter, hope I am not waisting too much your time, and who knows, maybe some time I would be able to help You too

.
Some remarks: first of all I am not calculating the time of one run but make average as You said. I have a real time system in which one of the initial tasks is to make difference image between 2 images, so I just calculate for a few seconds time for that loop and then make average, there are no mistakes in it for sure, because that issue was tested and works well in other parts. Also I have decided to try using open-source program named OpenCV (I am using it a lot, and recommend for everyone who deals with imaging). It has a simple function cvSub which makes what I need. And what can I say, in all the places where I tested it it works less than 1 msec, and here it works almost 4

Is there a chance that somehow the arrays are saved in wrong places in the memory (not sure that there is a point in my words)?
Thanks again for your time.
Leonid

leonidg · 12-27-2004, 07:18 AM

Friends, wapcaplet, I have another hint that might guide You helping me with my problem. I have tried to calculate only the time for the for-loop, it is indeed requires 4 msec. Now what I tried is copying my arrays to another temporary arrays and then making the difference on the temporary arrays and the performance of the for loop became 1 msec. So now I believe it is for sure a problem with my arrays locations.
Any ideas how can I deal it?
thanks