ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I understand how to benchmark with the time function an algorithm but what if the algorithm is very fast? Faster than the resolution of the clock. For example suppose I want to benchmark the following insturction
int x
int y
int z
x = y + z;
I can't use the standard time function call in C because the resolution of the clock is not fine enough. If I have a 2 GHZ CPU this instruction should take around 4 clock ticks ~ (4 x .5ns = 2 ns). I figure with a 2 GHZ CPU a single instruction should take 1/2,000,000,000 ns = .5 ns . These are just back of the envelope calculations. The logical solution would be to run the instruction in a loop and calculate the run time as LOOP_WITH_INST. I would then run the same program loop but no instruction and call that run time LOOP. If I subtract LOOP_WITH_INST - LOOP I will get just the run time of the instruction, call it RUN_TIME executed X amount of times. Using the following equation I can calculate the number of ns that single instruction takes.
(RUN_TIME/X) * 1,000,000,000 = number of NS for that single instruction.
When I try this on my AMD 2400+ running Slackware 10 I don't get anywhere near the 2 ns, my final answer is 0.0125 ns. I know I can expect some performance improvement due to cache but I didn't expect this much. Any ideas? The code I used is listed below:
I'm not sure why you think an add will take 4 clock ticks. In fact, if you do enough back to back adds it should average out to more like 1 clock tick because of the pipelining going on in the processor.
Also... make sure you do a gcc -S on that to see what the assembly output is... make sure it isn't optimizing all those instructions out.
takes four clock ticks because
1 clock tick to load y
1 clock tick to load z
1 clock tick to add
1 clock tick to put result in x
I could be wrong it has been awhile since I took a computer arch class. Even if the entire operation takes a single clock tick the resulting calculation should be .5 ns not .0125 ns. The operation is to fast that is why I think I have something wrong. Thanks for the help.
1 clock tick to load y
1 clock tick to load z
1 clock tick to add
1 clock tick to put result in x
Well, your partially right. The loads both happen at the same time. However, from an architecture standpoint, it still only takes 1 tick to do the add. Because there isn't just one instruction in the pipeline at the same time. There's one at fetch, one at decode, and so on. So over the course of a million instructions, providing you can keep the pipeline full (and you don't get pre-empted), you should see an average of about 1 add per clock cycle.
I had to do some benchmarking on a XScale last fall and the way we did it was using the high resolution timer and we did all the measurements from without a kernel module while holding all the locks. This ensured that nobody accept the interrupt handlers could kick us of the processor.
Also, compile your going into assembly and make sure the compiler isn't doing any voodoo magic:-P. (the -S option on gcc spits out the assembly). Whenever I do benchmarks I specify -O0 and make all the important vars volatile.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.