Benchmarking Small Instructions

rovitotv · 01-29-2005, 08:23 PM

I understand how to benchmark with the time function an algorithm but what if the algorithm is very fast? Faster than the resolution of the clock. For example suppose I want to benchmark the following insturction

int x
int y
int z

x = y + z;

I can't use the standard time function call in C because the resolution of the clock is not fine enough. If I have a 2 GHZ CPU this instruction should take around 4 clock ticks ~ (4 x .5ns = 2 ns). I figure with a 2 GHZ CPU a single instruction should take 1/2,000,000,000 ns = .5 ns . These are just back of the envelope calculations. The logical solution would be to run the instruction in a loop and calculate the run time as LOOP_WITH_INST. I would then run the same program loop but no instruction and call that run time LOOP. If I subtract LOOP_WITH_INST - LOOP I will get just the run time of the instruction, call it RUN_TIME executed X amount of times. Using the following equation I can calculate the number of ns that single instruction takes.
(RUN_TIME/X) * 1,000,000,000 = number of NS for that single instruction.

When I try this on my AMD 2400+ running Slackware 10 I don't get anywhere near the 2 ns, my final answer is 0.0125 ns. I know I can expect some performance improvement due to cache but I didn't expect this much. Any ideas? The code I used is listed below:

My calculations
LOOP = 234
LOOP_WITH_INST = 235
RUN_TIME = 1

(1/80,000,000,000) * 1,000,000,000 = 0.0125 ns

//loop.c ================================================
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

int main()
{
int x = 1;
int y = 1;
int z = 1;
int oc;
long int count;
time_t start;
time_t end;
double time_secs;

start = time(NULL);
for (count = 0; count < 2000000000; oc++)
{
for(oc = 0; oc < 40; oc++)
{
}
}
end = time(NULL);
time_secs = difftime(end, start);
printf("total execution time in seconds:%f minutes:%f \n",
time_secs, (time_secs/(double)60));

return 0;
}

loop_with_inst.c ===========================================
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

int main()
{
int x = 1;
int y = 1;
int z = 1;
int oc;
long int count;
time_t start;
time_t end;
double time_secs;

start = time(NULL);
for (count = 0; count < 2000000000; oc++)
{
for(oc = 0; oc < 40; oc++)
{
x = y + z;
}
}
end = time(NULL);
time_secs = difftime(end, start);
printf("total execution time in seconds:%f minutes:%f \n",
time_secs, (time_secs/(double)60));

return 0;
}

jtshaw · 01-29-2005, 10:35 PM

I'm not sure why you think an add will take 4 clock ticks. In fact, if you do enough back to back adds it should average out to more like 1 clock tick because of the pipelining going on in the processor.

Also... make sure you do a gcc -S on that to see what the assembly output is... make sure it isn't optimizing all those instructions out.

rovitotv · 01-30-2005, 07:30 AM

I assume the following:

x = y + z;

takes four clock ticks because
1 clock tick to load y
1 clock tick to load z
1 clock tick to add
1 clock tick to put result in x

I could be wrong it has been awhile since I took a computer arch class. Even if the entire operation takes a single clock tick the resulting calculation should be .5 ns not .0125 ns. The operation is to fast that is why I think I have something wrong. Thanks for the help.

gr33ndata · 01-30-2005, 08:15 AM

I think you hav e to convert that code into assembly first, then see

jtshaw · 01-30-2005, 11:17 AM

Code:

1 clock tick to load y
1 clock tick to load z
1 clock tick to add
1 clock tick to put result in x

Well, your partially right. The loads both happen at the same time. However, from an architecture standpoint, it still only takes 1 tick to do the add. Because there isn't just one instruction in the pipeline at the same time. There's one at fetch, one at decode, and so on. So over the course of a million instructions, providing you can keep the pipeline full (and you don't get pre-empted), you should see an average of about 1 add per clock cycle.

I had to do some benchmarking on a XScale last fall and the way we did it was using the high resolution timer and we did all the measurements from without a kernel module while holding all the locks. This ensured that nobody accept the interrupt handlers could kick us of the processor.

Also, compile your going into assembly and make sure the compiler isn't doing any voodoo magic:-P. (the -S option on gcc spits out the assembly). Whenever I do benchmarks I specify -O0 and make all the important vars volatile.