LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-26-2004, 04:36 AM   #1
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Rep: Reputation: 15
Strange performance


Hello.
I am using fedora core 1, programming a large system in c++.
Recently I have noticed that one part of my program is taking too much time, and I have no idea how to fix it. The part is responsible for making difference image between 2 images.
The code is:
-------------------------------------------------------------------------
void
Frame::MakeDiff() {
//substract frame1 from frame 0
int w,h,bpp;
int i,len;
unsigned short *s0,*s1, *sd;

mFr0->GetSizes(w,h,bpp);//w=640, h=480, bpp=12
len = w*h;
s0 = (unsigned short*)mFr0->GetPtr();//pointer to im1 data
s1 = (unsigned short*)mFr1->GetPtr(); pointer to im2 data
sd = (unsigned short*)mFrD->GetPtr();pointer to im dif data
for (i=0; i<len; i++,s0++,s1++,sd++)
*sd = *s0 - *s1;
}
-------------------------------------------------------------------------
running the code takes 10 msec.
If I change last 2 lines of the function to:
-------------------------------------------------------------------------
for(i=0; i<len; i++)
sd[i] = s0[i] - s1[i];
-------------------------------------------------------------------------
the function takes 4 msec. It is strange for me too, because seems that the same job is done.
Originally the function should take and took 1 msec on my system. I believe that maybe the problem is in adding other parts of my system which are not connected to that function but I have no idea how to find and deal it.
Please help me if You have any ideas what could cause such problems and/or You know how to deal it, or maybe any programs with memory which will guide me to the root of the problem.
Thanks in Advance,
Leonid
 
Old 12-26-2004, 02:10 PM   #2
aluser
Member
 
Registered: Mar 2004
Location: Massachusetts
Distribution: Debian
Posts: 557

Rep: Reputation: 43
Are you using any optimization flags? It could be that adding -O2 or -O3 to your compile will cause the two to compile to the same thing. Your second way is more readable anyway, so if it's at least as fast I'd stick to it.

I almost want to say that there are x86 instructions which take advantage of using an array plus offset, which would explain why the second way is faster -- I'm not sure if I remember that correctly though.

In any case, if you aren't using optimization flags now you may find that -O3 and/or -funroll-loops makes this sort of code a lot faster.
 
Old 12-26-2004, 02:17 PM   #3
wapcaplet
LQ Guru
 
Registered: Feb 2003
Location: Colorado Springs, CO
Distribution: Gentoo
Posts: 2,018

Rep: Reputation: 48
If the function used to take 1msec, what have you added or changed since then? A loop going through 640*480 iterations is going to take a little while.

The reason this:

Code:
for(i=0; i<len; i++)
  sd[i] = s0[i] - s1[i];
is faster than this:

Code:
for (i=0; i<len; i++,s0++,s1++,sd++)
  *sd = *s0 - *s1;
is probably because you are doing 6 operations per loop iteration in the second case (i++, s0++, s1++, sd++, *s0-*s1, and an assignment), whereas in the first case you are doing only 3 operations per loop iteration (i++, s0[i]-s1[i], and an assignment).

Last edited by wapcaplet; 12-26-2004 at 02:22 PM.
 
Old 12-26-2004, 02:31 PM   #4
aluser
Member
 
Registered: Mar 2004
Location: Massachusetts
Distribution: Debian
Posts: 557

Rep: Reputation: 43
I would think that s0[i] - s1[i] should be considered more than "one operation" because it's equivalent to *(s0 + i) - *(s1 + i), yes?
 
Old 12-26-2004, 03:08 PM   #5
wapcaplet
LQ Guru
 
Registered: Feb 2003
Location: Colorado Springs, CO
Distribution: Gentoo
Posts: 2,018

Rep: Reputation: 48
Yeah, that's true. I don't know the details of how the compiler would handle that, or whether there's a way to optimize it; even so, it may be the case that the operations involved in array indexing are less costly than the operation of incrementing several pointers.

When using an array index, the compiler may be able to condense *(pointer + offset) into a single assembly instruction, thus making it equivalent, in terms of number of instructions, to using *(pointer) (and probably also equivalent in terms of execution time). Then, the only difference between the two approaches would be the number of other operations done, which is greater in the direct-pointer-arithmetic approach in this situation.

The cache cost of maintaining four incrementing variables (i, s0, s1, and sd) instead of only one (i) may also have an impact; using four, the CPU may not be able to keep all of them in registers, which would mean more I/O from the CPU cache. With only one (i), it can most likely stay in a register, and save time by not going out to the cache as frequently.
 
Old 12-27-2004, 01:26 AM   #6
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
seems like your reason, wapcaplet, but still I really want it to work 1 msec. I have other functions with the image, like thresholding which take 1msec, and it doesn't look nice when that function needs much more. Some remarks, I was using -O2 -funroll-loops optimization. I have tried now -O3 it doesn't change that part. The problem is that when I am testing an older version of my program I have there performance of 1msec for that code and I have no idea how to find what causes that change. Another strange thing that I have noticed is if I change only one pointer (not all as I showed) for example:
-------------------------------------
for (i=0; i<len; i++,s1++,sd++)
*sd = s0[i] - *s1;
-------------------------------------
The performance changes from 10 to 4 msec as well. I believe wapcaplet explains it well. But how can I make t back working 1 msec.
Thanks for Your help
Leonid
 
Old 12-27-2004, 01:34 AM   #7
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
Quote:
Originally posted by wapcaplet
If the function used to take 1msec, what have you added or changed since then? A loop going through 640*480 iterations is going to take a little while.
The changes I have done have no connection to that part of code. Mostly it is comments which made my project twice the size of the earlier, some changes on gui, and addition to processing functions for other images (color images), but it is not done in the same time so I believe and hope it should not affect.
As I mentioned processing of the same image in other part of the code takes 1 msec (thresholding).
Leonid
 
Old 12-27-2004, 01:38 AM   #8
itsme86
Senior Member
 
Registered: Jan 2004
Location: Oregon, USA
Distribution: Slackware
Posts: 1,246

Rep: Reputation: 59
Why not do:
Code:
for (i=0; i<len; i++)
  *sd++ = *s0++ - *s1++;
?
 
Old 12-27-2004, 01:41 AM   #9
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
Quote:
Originally posted by itsme86
Why not do:
Code:
for (i=0; i<len; i++)
  *sd++ = *s0++ - *s1++;
?
I have tried, the same performance of 4 msec. I believe the reason should be in other place, because that code worked faster and works faster now in previous version of my project. My question is have to find what causes the problem, if it is possible.
Thanks,
Leonid
 
Old 12-27-2004, 01:51 AM   #10
itsme86
Senior Member
 
Registered: Jan 2004
Location: Oregon, USA
Distribution: Slackware
Posts: 1,246

Rep: Reputation: 59
Here's a little test program I made to test the speed difference:
Code:
itsme@dreams:~/C$ cat spleen.c
#include <stdio.h>
#include <sys/time.h>

void tvdiff(struct timeval *tv1, struct timeval *tv2)
{
  tv2->tv_sec -= tv1->tv_sec;
  tv2->tv_usec -= tv1->tv_usec;
  if(tv2->tv_usec < 0)
  {
    tv2->tv_usec += 1000000;
    tv2->tv_sec++;
  }
}

void way_one(int *a, int *b, int *c)
{
  int i;

  for(i = 0;i < 640*480;i++)
    *a++ = *b++ - *c++;
}

void way_two(int *a, int *b, int *c)
{
  int i;

  for(i = 0;i < 640*480;i++, a++, b++, c++)
    *a = *b - *c;
}

int main(void)
{
  int a[640*480], b[640*480], c[640*480];
  struct timeval start, end;

  gettimeofday(&start, NULL);
  way_one(a, b, c);
  gettimeofday(&end, NULL);
  tvdiff(&start, &end);
  printf("way one diff: %d.%06d\n", end.tv_sec, end.tv_usec);

  gettimeofday(&start, NULL);
  way_two(a, b, c);
  gettimeofday(&end, NULL);
  tvdiff(&start, &end);
  printf("way two diff: %d.%06d\n", end.tv_sec, end.tv_usec);

  return 0;
}
Code:
itsme@dreams:~/C$ ./spleen
way one diff: 0.022265
way two diff: 0.014265
So you can see that the second way (the way I suggested in my previous post) is nearly twice as fast.

EDIT: After compiling with -O3 and -funroll-loops I get:
Code:
way one diff: 0.014055
way two diff: 0.011169

Last edited by itsme86; 12-27-2004 at 01:53 AM.
 
Old 12-27-2004, 02:00 AM   #11
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
Again, Thanks for your effort to help, but I believe the problem is not in the code, I have tried all the variants, and the best I can get is 4 msec, when all that variants in previous version give 1 msec. I just need to know if it is possible to find the reason, maybe using some testing programs.
Thanks,
Leonid
 
Old 12-27-2004, 03:43 AM   #12
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
wapcaplet, I noticed that You are familiar with the registers issue. Do You think maybe they are not empty before I begin running the code because of the new added functions to the project, and that is why the program needs more time. Is there an option to check it and maybe to empty them.
By the way itsme86, i have tested your code, the result is about 1600 which mean if I understand right less than 2msec which is good enough for me, but unfortunately it does not happen in my program.
Thanks,
Leonid
 
Old 12-27-2004, 06:14 AM   #13
wapcaplet
LQ Guru
 
Registered: Feb 2003
Location: Colorado Springs, CO
Distribution: Gentoo
Posts: 2,018

Rep: Reputation: 48
Quote:
Originally posted by leonidg
Again, Thanks for your effort to help, but I believe the problem is not in the code, I have tried all the variants, and the best I can get is 4 msec, when all that variants in previous version give 1 msec. I just need to know if it is possible to find the reason, maybe using some testing programs.
Well, the problem has to be in the code, though. If that function is exactly the same, then try looking for changes elsewhere that might have affected this function. You should reason each part of the code through to see whether it might be causing the delay; you've obviously already found the for-loop in question that can be optimized. Look at all the other function calls that are occurring in that function, to see if they could be contributing significantly to the time; do any of them have for loops in them? There's no data being passed to this function, but in general you should also look for changes in the function arguments or its input. Variables are...variable

You may also be able to find a more efficient way to approach the problem of doing an image difference. I wouldn't know for sure, but it's possible that doing a simple subtraction on every pixel is not the most efficient way. Maybe it would work faster if you somehow did blocks of pixels, or found a way to skip over pixels that are the same color as previous ones. Try looking at the code for other open-source programs that do this (like the GIMP or ImageMagick), to see if they have a better way to do it.

One way to get more accurate timing for an individual command is something I learned in a college course; put the code you want to test inside a for loop for say, 1000000 iterations. Time it, then time it again with an empty for loop (to find out how much overhead is introduced by the loop itself). Divide them by 1000000 to find averages. Subtract the second from the first; the result is a good estimate of how long the code in question takes, down to much greater resolution than timing only one execution of it. You might discover that, on average, the function takes less than 4msec; the only way to be sure is to do lots of iterations.

To answer your question about registers, their state before entering the function would have little effect on its execution time, since they would be filled appropriately (if your compiler is good at optimization, which gcc certainly is). Filling the registers with needed values one time is very fast, but in my earlier hypothesis, some register is having its data overwritten and re-fetched from cache for every iteration of the loop. Fetching from cache is fast, but not nearly as fast as just using a register (and, if there's some chance your CPU cache is full, it might even go out to RAM, which is like going on vacation for two weeks compared to the speed of the registers). At any rate, there's not much you as the programmer can do to improve register efficiency; gcc is quite good at optimizing that kind of thing, so unless you have a very detailed understanding of how to improve that efficiency (and want to put chunks of assembly language in your code), it is probably not worth your while at this point. With a mission-critical application, like the Linux kernel, say, then it may be worthwhile to get down to that level, but for image compositing, it's probably not going to make much difference (at least in terms of how the end user of your program perceives its performance).

Last edited by wapcaplet; 12-27-2004 at 06:18 AM.
 
Old 12-27-2004, 06:52 AM   #14
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
wapcaplet, thanks for your letter, hope I am not waisting too much your time, and who knows, maybe some time I would be able to help You too.
Some remarks: first of all I am not calculating the time of one run but make average as You said. I have a real time system in which one of the initial tasks is to make difference image between 2 images, so I just calculate for a few seconds time for that loop and then make average, there are no mistakes in it for sure, because that issue was tested and works well in other parts. Also I have decided to try using open-source program named OpenCV (I am using it a lot, and recommend for everyone who deals with imaging). It has a simple function cvSub which makes what I need. And what can I say, in all the places where I tested it it works less than 1 msec, and here it works almost 4
Is there a chance that somehow the arrays are saved in wrong places in the memory (not sure that there is a point in my words)?
Thanks again for your time.
Leonid
 
Old 12-27-2004, 07:18 AM   #15
leonidg
Member
 
Registered: Dec 2004
Posts: 73

Original Poster
Rep: Reputation: 15
Friends, wapcaplet, I have another hint that might guide You helping me with my problem. I have tried to calculate only the time for the for-loop, it is indeed requires 4 msec. Now what I tried is copying my arrays to another temporary arrays and then making the difference on the temporary arrays and the performance of the for loop became 1 msec. So now I believe it is for sure a problem with my arrays locations.
Any ideas how can I deal it?
thanks
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Strange Distro / PC performance problem Clemente Linux - General 2 11-24-2005 02:01 PM
performance exvor Programming 12 08-01-2005 04:41 PM
Strange Cups Performance jtshaw *BSD 1 02-03-2005 12:57 PM
strange, strange alsa problem: sound is grainy/pixellated? fenderman11111 Linux - Software 1 11-01-2004 05:16 PM
Sound Issues with XMMS/ mpg123 strange (strange noises) thegreatbob Linux - Software 0 06-25-2004 03:18 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration