Hm. Okay, that seems extremely helpful, thanks!
With that in mind, I get reasonably accurate results without having to resort to anything particularly complicated. My findings:
(Hopefully helpful to someone; I can't find this information elsewhere so thought I'd post it. I'm just going to include everything I, as a relative newbie, found interesting in this exercise.) If anyone has any comments or suggestions, I'd much appreciate it.
flops.c
Code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
double float_increment = 0.0000000000019346;
double start = 57.240000;
double floating_point = start;
long long i;
long long operations = 1000000000;
for (i = 0; i < operations; ++i) {
floating_point += float_increment;
}
printf("%lf\n", floating_point);
return EXIT_SUCCESS;
}
Outputs 57.241935, so no large numbers. I note that I'm measuring double precision FLOPS.
Notes on compilation flags:
Code:
* --unroll-loops flag does nothing iff -O[2|3] set
* -O2/O3 etc, if float_increment very small (ie < 10^-13ish) the
program executes in 0.001s regardless of the value of
"operations" (be it 10^6 or 10^18). i assume the compiler
correctly decides that adding a lot of zeroes (to "double"
accuracy) to a number will not change the number.
* -O increases speed similarly to -O2 or -O3 for float_increment >
10^-13ish, but for lesser values:
operations == 10^9 --> time = 0.2s
operations == 10^8 --> time = 0.02s
operations == 10^10 -> time = 20s, a phenomenon which I can't explain
so i stuck with -O3, where no such anomaly occurs.
Reading the assembly was the most valuable part. (By the way, I'm quite amazed at how human-readable Intel assembly is. It sort of makes sense and I started learning it five minutes ago.)
flops.s
Code:
.file "flops.c"
.section .rodata
.LC2:
.string "%lf\n"
.text
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $64, %esp
fldl .LC0
fstpl 40(%esp)
fldl .LC1
fstpl 32(%esp)
fldl 32(%esp)
fstpl 56(%esp)
movl $1000000000, 24(%esp)
movl $0, 28(%esp)
movl $0, 48(%esp)
movl $0, 52(%esp)
jmp .L2
.L3:
fldl 56(%esp)
faddl 40(%esp)
fstpl 56(%esp)
addl $1, 48(%esp)
adcl $0, 52(%esp)
.L2:
movl 48(%esp), %eax
movl 52(%esp), %edx
cmpl 28(%esp), %edx
jl .L3
cmpl 28(%esp), %edx
jg .L5
cmpl 24(%esp), %eax
jb .L3
.L5:
movl $.LC2, %eax
fldl 56(%esp)
fstpl 4(%esp)
movl %eax, (%esp)
call printf
movl $0, %eax
leave
ret
.size main, .-main
.section .rodata
.align 8
.LC0:
.long -672592136
.long 1024837658
.align 8
.LC1:
.long 1374389535
.long 1078763192
.ident "GCC: (GNU) 4.5.1"
.section .note.GNU-stack,"",@progbits
As I understand this, the program __basically__ runs as follows:
* run main(19 instructions). we are now at L2
* L2 executes 3 instructions, and jumps to L3 if i < 1000000000 (see constant earlier in the code). we are now at L3
* L3 executes 5 instructions, somewhere in here i gets incremented, and we go back to L2.
* Eventually "jl L3" will be false, so we get to "jg L5", which executes 8-odd instructions and exits.
So, we more or less execute [length(L3 + {pre-jump L2}) == 9] * [big constant == 100000000] == 9000000000 instructions, most of which are on double-precision floating point values so just say all of them are FLOPs.
Code:
$ gcc -O3 -o flops flops.c
$ time ./flops
57.241935
real 0m3.026s
user 0m3.023s
sys 0m0.000s
Hence FLOPS = 9000000000 / 3.026 == 3 GDFLOPS == 6 GFLOPS. Just as required, more or less.
. A bit big. The CPU was 2x Intel Atom N280 at 1.66GHz.
If I'm correct that:
* movl is a stack (ie, non-float) operation
* cmpl, addl, adcl, j[l|g] are integer operations
* fldl, faddl, fstpl are our float operations,
then we have only 3000000000 double-float operations --> 2 GFLOPS. Makes little difference, but can anyone confirm/correct my assembly?