variation in value returned from gettimeofday()

blue bottle · 12-29-2010, 11:14 PM

Hello

I needed to find the time it takes to execute a for loop...
I decided to use the gettimeofday() function and wrote the following code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sched.h> // for use of sched_setaffinity

#include "/usr/include/time.h"

void test();

int main () {

unsigned long mask;
unsigned int len = sizeof(mask);

// Get the affinity of the current process
if ( sched_getaffinity(0, len, &mask) == 0 ) {
printf("\nThe affinity mask for current process is: %08lx", mask);

// Set the affinity of the current process
mask = 2;
if (sched_setaffinity(0, len, &mask) == 0) {
printf("\nChecking that affinity mask has been set to 2 ...");

// Check that affinity of this process has been set to CPU 0
sched_getaffinity(0, len, &mask);
if (mask == 2 ) {
printf(" yes (new affinity is: %08lx)\n", mask);
} else {
printf(" no (affinity is still: %08lx)\n", mask);
}
test ();
} else {
perror("sched_setaffinity");
}

} else {
perror("sched_getaffinity");
return -1;
}

return 0;
}

void test () {

int i, j; // counter variable

struct timeval begin_time, end_time; // variables to record time from gettimeofday()
double total;

// ***** Use gettimeofday() to record time taken *****
total = 0;
for (i = 0; i < 15; i++) {

gettimeofday(&begin_time, 0);
for (j = 0; j < 1000; j++);
gettimeofday (&end_time, 0);

printf("\nCounting to 1000 took: %ld microseconds ", end_time.tv_usec - begin_time.tv_usec);
total += (end_time.tv_usec - begin_time.tv_usec);
}

printf("\nAverage time taken: %.3f microseconds \n", total / 15);
}

The output I get is:
====================

The affinity mask for current process is: 00000001
Checking that affinity mask has been set to 2 ... yes (new affinity is: 00000002)

Counting to 1000 took: 20 microseconds
Counting to 1000 took: 19 microseconds
Counting to 1000 took: 20 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 19 microseconds
Counting to 1000 took: 20 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 19 microseconds
Counting to 1000 took: 19 microseconds
Counting to 1000 took: 19 microseconds
Counting to 1000 took: 9 microseconds
Average time taken: 14.533 microseconds

I think that something is causing the values to vary quite a bit (the result is either 9 or 19~20)...

I am using a laptop with Debian 5.0.6 (1GB memory, 2 processors at T5500@1.66GHz each).
Before I ran the above program, I did the following:

1. Set the default affinity of all process's on boot to cpu 0 by adding:

/bin/taskset -p 1 1
/bin/taskset -p 1 $$

to the top of /etc/init.d/rcS

2. Rebooted

3. Used System monitor to check that utilization of CPU2 is 0%

4. Made sure the laptop was not connected to any network

I am at a loss to find out what is causing this variation

... any suggestions??

Thanks

btw, my /proc/interrupts file is:
sana@HAIL:~$ cat /proc/interrupts
CPU0 CPU1
0: 423461 0 IO-APIC-edge timer
1: 14926 0 IO-APIC-edge i8042
8: 5 0 IO-APIC-edge rtc0
9: 1 0 IO-APIC-fasteoi acpi
12: 123 0 IO-APIC-edge i8042
14: 35463 0 IO-APIC-edge ata_piix
15: 42935 0 IO-APIC-edge ata_piix
16: 0 0 IO-APIC-fasteoi i915@pci:0000:00:02.0
17: 40464 0 IO-APIC-fasteoi eth0
18: 12 0 IO-APIC-fasteoi mmc0
19: 2 0 IO-APIC-fasteoi ohci1394
20: 3 0 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb5
21: 407800 0 IO-APIC-fasteoi uhci_hcd:usb2, HDA Intel
22: 45979 0 IO-APIC-fasteoi uhci_hcd:usb3
23: 0 0 IO-APIC-fasteoi uhci_hcd:usb4
NMI: 0 0 Non-maskable interrupts
LOC: 177460 215150 Local timer interrupts
RES: 5905 38820 Rescheduling interrupts
CAL: 85 28373 function call interrupts
TLB: 9 11 TLB shootdowns
TRM: 0 0 Thermal event interrupts
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0
sana@HAIL:~$

KenJackson · 12-30-2010, 01:10 AM

I ran it on my laptop with Fedora 14 (1GB memory, 2 processors at T7250 @2.0GHz each).

Code:

The affinity mask for current process is: 00000003
Checking that affinity mask has been set to 2 ... yes (new affinity is: 00000002)

Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 11 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 10 microseconds 
Counting to 1000 took: 11 microseconds 
Average time taken: 10.133 microseconds

I didn't reboot, didn't disconnect from the network, have about 8 applications running and 30 tabs open in Firefox. And my system is pretty unloaded as shown by the uptime command:
02:07:11 up 6 days, 10:12, 4 users, load average: 0.03, 0.11, 0.13

I wonder if you have something running that's swapping memory.
Does /proc/meminfo show SwapCached as 0KB and SwapTotal == SwapFree?

orgcandman · 12-30-2010, 11:42 AM

There are lots of things that can cause variation. Is your system "tick-less" (ie: CONFIG_NO_HZ=y)? Does your CPU have lots of SMIs? How are you getting your time information (where does gettimeofday get its info from)?

One quick fix I'd make to your script is to use clock_gettime() with CLOCK_MONOTONIC from the rt library (link with -lrt). With gettimeofday(), if some task (say ntpd) is updating your timing source, the information might be bad.

One other thing to check:
you seem to have a lot of rescheduling on core #1. Perhaps set the scheduling to sched_fifo / sched_rr and a prio of 77 (I just picked that out of thin air - "most prioritized" is 99). That will ensure that your software isn't getting preempted by any kind of software interrupt handler.

blue bottle · 01-02-2011, 12:32 AM

Sorri, for the delay. Now to answer your questions...

@KenJackson
/proc/meminfo shows:
SwapCached: 0kB
and yes SwapTotal is equal to SwapFree
SwapTotal: 2698880 kB
SwapFree: 2698880 kB

@orgcandman
Seems like my system is ‘tick-less’ (i.e. CONFIG_NO_HZ=y).
grep HZ /boot/config-`uname –r` gives:
CONFIG_NO_HZ=y
# CONFIG_NO_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_MACHZ=m

I tried clock_gettime() as you suggested. This is the body of the new test():
int i, j;
struct timespec begin_time, end_time;
double total = 0;

for (i = 0; i < 15; i++) {
clock_gettime(CLOCK_MONOTONIC, &begin_time);
for (j = 0; j < 1000; j++ );
clock_gettime(CLOCK_MONOTONIC, &end_time);
printf(“\nCounting to 1000 took: %ld nanoseconds”,
end_time.tv_nsec – begin_time.tv_nsec);
total += end_time.tv_nsec – begin_time.tv_nsec;
}
printf(“\nAverage time taken: %.3f nanoseconds \n”, total / 15);

but again the output varies

Counting to 1000 took: 9219 nanoseconds
Counting to 1000 took: 9079 nanoseconds
Counting to 1000 took: 8940 nanoseconds
Counting to 1000 took: 19486 nanoseconds
Counting to 1000 took: 19486 nanoseconds
Counting to 1000 took: 19556 nanoseconds
Counting to 1000 took: 9010 nanoseconds
Counting to 1000 took: 9080 nanoseconds
Counting to 1000 took: 9079 nanoseconds
Counting to 1000 took: 19276 nanoseconds
Counting to 1000 took: 20115 nanoseconds
Counting to 1000 took: 19486 nanoseconds
Counting to 1000 took: 9010 nanoseconds
Counting to 1000 took: 9079 nanoseconds
Counting to 1000 took: 9009 nanoseconds
Average time taken: 13260.667 nanoseconds

Btw, gettimeofday() is defined in sys/time.h and http://www.manpagez.com/man/2/gettimeofday/

I read somewhere that the configuration option CONFIG_PREEMPT in the 2.6 kernel can help to get soft real-time performance. Apparently, it is not set...
grep PREE /boot/config-`uname –r` gives:
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

Btw, uname –a gives:
Linux HAIL 2.6.26-2-686 #1 SMP Mon Aug 30 07:01:57 UTC 2010 i686 GNU/Linux

I am really running out of ideas... any suggestions?

thanxs for all your help...

blue bottle · 01-03-2011, 11:59 PM

More updates...

I decided to run the code on a desktop with Fedora 13 and get much better output:
[sana@localhost Benchmarking-Example]$ ./application

The affinity mask for current process is: 00000003
Checking that affinity mask has been set to 2 ... yes (new affinity is: 00000002)

Counting to 1000 took: 4 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 4 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 4 microseconds
Counting to 1000 took: 4 microseconds
Counting to 1000 took: 4 microseconds
Counting to 1000 took: 4 microseconds
Counting to 1000 took: 4 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 3 microseconds
Counting to 1000 took: 3 microseconds
Average time taken: 3.467 microseconds
[sana@localhost Benchmarking-Example]$

The biggest difference I can see is that CONFIG_HZ is set to 1000 on the desktop and 250 on the laptop. Could this be the cause of the variation?
[sana@localhost ~] $ grep HZ /boot/config-2.6.27.41-170.2.117.fc10.i686
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_MACHZ_WDT=m
[sana@localhost ~]$

Is is possible to change the value of CONFIG_HZ on the laptop to 1000?

blue bottle · 01-04-2011, 12:35 AM

Or better still... can I set CONFIG_NO_HZ to n? Would that improve reduce the variation even more? I am not sure how this it is to be done... any ideas?

thanks

orgcandman · 01-04-2011, 03:00 PM

I'd do:

CONFIG_NO_HZ=n
CONFIG_HZ=1000
CONFIG_HZ_1000=y

Should do much better for performance at a loss of battery life tradeoff.

Additionally, if you can turn on voluntary kernel preemption, preempting the Big Kernel Lock, you might get even better performance.

Nominal Animal · 01-04-2011, 05:08 PM

Assuming you used -O0 (no optimization, otherwise the loop may be optimized away) it's just scheduling noise and cache latencies.
Wait for a timer tick (by waiting until user time in getrusage() changes), before starting the measurement to minimize scheduling noise.

Try the following code:

Code:

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>

static inline double toseconds(const struct timeval t)
{
return (double)t.tv_sec + (double)t.tv_usec / 1000000.0;
}

static inline unsigned long cycles(void)
{
register unsigned int lo, hi;
asm volatile ("rdtsc" : "=a" (lo), "=d" (hi));
return ((unsigned long)hi << 32UL) | (unsigned long)lo;
}

volatile int dummy = 0;

int main(int argc, char **argv)
{
int arg;
unsigned long c;
long i;
double ssys, suser, swall;
struct timeval begin, end;
struct rusage before, after;

for (arg = 1; arg < argc; arg++) {
char ignore;
if (sscanf(argv[arg], "%ld %c", &i, &ignore) != 1) {
fprintf(stderr, "%s: Invalid number of loop cycles.\n", argv[arg]);
continue;
}
if (i < 1L) {
fprintf(stderr, "%s: Number of loop cycles is too small.\n", argv[arg]);
continue;
}

printf("%lu loops:\n", i);
fflush(stdout);

/* Wait until change at user time in usecs (or secs) */
getrusage(RUSAGE_SELF, &after);
do {
gettimeofday(&begin, NULL);
getrusage(RUSAGE_SELF, &before);
} while (before.ru_utime.tv_sec == after.ru_utime.tv_sec &&
before.ru_utime.tv_usec == after.ru_utime.tv_usec);
gettimeofday(&begin, NULL);
c = cycles();

/* Work loop -- volatile assignment cannot be optimized away. */
while (i-->0L)
dummy = 0;

c = cycles() - c;
gettimeofday(&end, NULL);
getrusage(RUSAGE_SELF, &after);

swall = toseconds(end) - toseconds(begin);
suser = toseconds(after.ru_utime) - toseconds(before.ru_utime);
ssys = toseconds(after.ru_stime) - toseconds(before.ru_stime);

printf("Total cycles: %lu\n", c);
printf("User time: %.6f seconds\n", suser);
printf("System time: %.6f seconds\n", ssys);
printf("Total CPU time: %.6f seconds\n", suser + ssys);
printf("Wall clock: %.6f seconds\n", swall);
printf("\n");
fflush(stdout);
}

return 0;
}

You can compile this with or without optimizations (-O0 to -O3); the loop repeatedly zeroes a volatile variable - an operation which the compiler cannot optimize away. Supply the number(s) of loop cycles you wish to test as command line parameters.

The number of clock cycles will vary a bit due to cache effects.
First run will always have a higher clock cycle count, because the code and data will not be in cache. Afterwards, caches are "hot", unless evicted by the kernel (process scheduler), perhaps because another process was run for a while on the same CPU.
Clocks normally change at tick intervals.
I believe Linux gettimeofday() on x86 uses the clock cycle count (rdtsc) internally to achieve better than tick precision.
When measuring user or system time via getrusage(), wait until the struct timeval changes before starting the measurement. Any measured interval is then correct to within one timer tick interval; the result may be one tick too short, but never too long. Measure and add half a tick for best possible resolution: +-0.5 timer tick interval.
If you start measuring elapsed time near the end of the time slice given to this process, you will not only include the scheduler but also other processes run on this core, in the wall clock time; also, the number of cycles taken will vary wildly because the caches may or may not be evicted in the middle of the testing run.
You can check the timer tick (resolution) by calling getrusage() repeatedly, and calculating the delta when the user time changes.
Wall clock includes kernel overhead and other processes run on the same CPU -- and you don't normally want to include those.
User time is the time spent running userspace code, including any libraries.
System time is the time spent in the kernel by this process, for example in reading/writing files or sockets.
CPU time (user + system time) is the total CPU time taken by this process. When you do computation and I/O, this is the most interesting measurement.
Use clock cycle count (rdtsc) to measure pure CPU and CPU+memory workloads; sort algorithms, math, et cetera. Repeat the measurement and check the statistics on the results. If using I/O, ignore or separate those clock cycles.

Hope this helps,

Nominal Animal

blue bottle · 01-05-2011, 04:55 AM

Hi,

I recompiled the kernel on my laptop, so now:

sana@HAIL:~/test$ grep HZ /boot/config-2.6.26
# CONFIG_NO_HZ is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_MACHZ_WDT=m
sana@HAIL:~/test$

I ran the program (the one in my 1st post in this thread) twice:
sana@HAIL:~/test$ ./application

The affinity mask for current process is: 00000001
Checking that affinity mask has been set to 2 ... yes (new affinity is: 00000002)

Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 8 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Average time taken: 8.933 microseconds
sana@HAIL:~/test$ ./application

The affinity mask for current process is: 00000001
Checking that affinity mask has been set to 2 ... yes (new affinity is: 00000002)

Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 10 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 8 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Counting to 1000 took: 9 microseconds
Average time taken: 9.000 microseconds
sana@HAIL:~/test$

There seems to be much less variation now...

Btw, I chose the Preemption Model as 'Preemptible kernel (Low-Latency Desktop)' during menuconfig. So, now:
sana@HAIL:~/test$ grep PREE /boot/config-2.6.26
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_DEBUG_PREEMPT=y
sana@HAIL:~/test$

Any ideas if chosing 'Voluntary Kernel Preemption' instead improve results?

thank you all for your help

orgcandman · 01-05-2011, 07:34 AM

Your results won't improve much beyond what they are now. Tickless systems are awful for scheduling latency since they go to sleep (which means they're GREAT for battery life). As you've noticed. You'll also notice things like select waits taking MUCH longer than they should with a tickless system.