Ways of measuring FLOPS

duskxiv · 09-25-2010, 08:48 AM

I've found various programs and methods to measure FLOPS, but I don't seem to understand them. What is wrong with the following?

test-flops.py

Code:

#! /usr/bin/env python
float_increment = 1.03 # random
start = 57.24 # random
floating_point = start

for i in xrange(10 ** 6):
    floating_point += float_increment

Code:

PROMPT$ time ./test-flops.py

As far as I can see, this times a program that performs 1 million FLOPs (namely incrementation).

This outputs about 1.2 seconds on my (reasonably fast) machine, giving 840kFLOPS, which is much too low.

Edit: This post may be in the wrong section; a C translation increments the result by three orders of magnitude (ie, py sucks), but this is still considerably less (~ 7-fold) than published values.

neonsignal · 09-26-2010, 03:05 AM

Benchmarking is a non-trivial exercise. There are many factors (especially for floating point units) that can impact the measurements - data paths, the choice of operations, etc.

Still, if you want to continue on with your own implementation (and it is a good learning exercise), here are some quick thoughts:

* use a compiled language rather than an interpreted one for tight loops and time critical code (this is why your C implementation runs a lot faster than the Python one)

* make sure you turn on optimization (eg for gcc use '-O3', and there are other flags that can also help), otherwise the floating point values will get loaded and unloaded from the floating point processor each time you perform the operation

* unroll the loop, so that loop overheads do not significantly impact the measurement, eg your example code would become

Code:

for i in xrange(10 ** 5):
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment

(or perhaps better, get the compiler to do it for you using '-funroll-loops', since if you unroll too far it can change the way the instruction cache works)

* have a look at the generated code (use the gcc '-S' flag to generate an assembler file) so that you can see what instructions you are actually benchmarking; you don't need to understand all the assembler, just the part around the loop (there will be a label and a jump back to it)

* be careful about special cases; for example, you are adding a number to a sum that will become very large

* be aware of the different floating point types (eg double precision vs single precision)

duskxiv · 09-26-2010, 08:15 AM

Hm. Okay, that seems extremely helpful, thanks!

With that in mind, I get reasonably accurate results without having to resort to anything particularly complicated. My findings:

(Hopefully helpful to someone; I can't find this information elsewhere so thought I'd post it. I'm just going to include everything I, as a relative newbie, found interesting in this exercise.) If anyone has any comments or suggestions, I'd much appreciate it.

flops.c

Code:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    double float_increment = 0.0000000000019346;
    double start = 57.240000;
    double floating_point = start;
    long long i;
    long long operations = 1000000000;
    for (i = 0; i < operations; ++i) {
        floating_point += float_increment;
    }   
    printf("%lf\n", floating_point);

    return EXIT_SUCCESS;
}

Outputs 57.241935, so no large numbers. I note that I'm measuring double precision FLOPS.

Notes on compilation flags:

Code:

    * --unroll-loops flag does nothing iff -O[2|3] set

    * -O2/O3 etc, if float_increment very small (ie < 10^-13ish) the
      program  executes in 0.001s regardless of the value of 
      "operations" (be it 10^6 or 10^18). i assume the compiler 
      correctly decides that adding a lot of zeroes (to "double"
      accuracy) to a number will not change the number. 

    * -O increases speed similarly to -O2 or -O3 for float_increment >
      10^-13ish, but for lesser values:
operations == 10^9 --> time = 0.2s
operations == 10^8 --> time = 0.02s
operations == 10^10 -> time = 20s, a phenomenon which I can't explain 
    so i stuck with -O3, where no such anomaly occurs.

Reading the assembly was the most valuable part. (By the way, I'm quite amazed at how human-readable Intel assembly is. It sort of makes sense and I started learning it five minutes ago.)

flops.s

Code:

	.file	"flops.c"
	.section	.rodata
.LC2:
	.string	"%lf\n"
	.text
.globl main
	.type	main, @function
main:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	subl	$64, %esp
	fldl	.LC0
	fstpl	40(%esp)
	fldl	.LC1
	fstpl	32(%esp)
	fldl	32(%esp)
	fstpl	56(%esp)
	movl	$1000000000, 24(%esp)
	movl	$0, 28(%esp)
	movl	$0, 48(%esp)
	movl	$0, 52(%esp)
	jmp	.L2
.L3:
	fldl	56(%esp)
	faddl	40(%esp)
	fstpl	56(%esp)
	addl	$1, 48(%esp)
	adcl	$0, 52(%esp)
.L2:
	movl	48(%esp), %eax
	movl	52(%esp), %edx
	cmpl	28(%esp), %edx
	jl	.L3
	cmpl	28(%esp), %edx
	jg	.L5
	cmpl	24(%esp), %eax
	jb	.L3
.L5:
	movl	$.LC2, %eax
	fldl	56(%esp)
	fstpl	4(%esp)
	movl	%eax, (%esp)
	call	printf
	movl	$0, %eax
	leave
	ret
	.size	main, .-main
	.section	.rodata
	.align 8
.LC0:
	.long	-672592136
	.long	1024837658
	.align 8
.LC1:
	.long	1374389535
	.long	1078763192
	.ident	"GCC: (GNU) 4.5.1"
	.section	.note.GNU-stack,"",@progbits

As I understand this, the program __basically__ runs as follows:
* run main(19 instructions). we are now at L2
* L2 executes 3 instructions, and jumps to L3 if i < 1000000000 (see constant earlier in the code). we are now at L3
* L3 executes 5 instructions, somewhere in here i gets incremented, and we go back to L2.
* Eventually "jl L3" will be false, so we get to "jg L5", which executes 8-odd instructions and exits.

So, we more or less execute [length(L3 + {pre-jump L2}) == 9] * [big constant == 100000000] == 9000000000 instructions, most of which are on double-precision floating point values so just say all of them are FLOPs.

Code:

$ gcc -O3 -o flops flops.c
$ time ./flops
57.241935

real	0m3.026s
user	0m3.023s
sys	0m0.000s

Hence FLOPS = 9000000000 / 3.026 == 3 GDFLOPS == 6 GFLOPS. Just as required, more or less.

. A bit big. The CPU was 2x Intel Atom N280 at 1.66GHz.

If I'm correct that:
* movl is a stack (ie, non-float) operation
* cmpl, addl, adcl, j[l|g] are integer operations
* fldl, faddl, fstpl are our float operations,
then we have only 3000000000 double-float operations --> 2 GFLOPS. Makes little difference, but can anyone confirm/correct my assembly?

neonsignal · 09-26-2010, 09:43 AM

Thanks for posting back.

The flag was actually '-funroll-loops', not '--unroll-loops'. If you produce the assembler from your code using these flags 'gcc -O3 --funroll-loops -S', the loop section should end up looking something like this:

Code:

.L2:
	fadd	%st, %st(1)
	addl	$8, %eax
	cmpl	$1000000000, %eax
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	jne	.L2

You are correct that the fadd etc instructions are the floating point ones. fldl is a load and fstl is a store; these just do transfers to and from the FPU stack. You will notice in the loop above that there are no fldl or fstl, because the optimization removes these (they happen before and after the loop). And the compiler has unrolled the loop so that 8 iterations are done at a time.

neonsignal · 09-26-2010, 04:45 PM

Quote:

Originally Posted by duskxiv

* fldl, faddl, fstpl are our float operations,
then we have only 3000000000 double-float operations --> 2 GFLOPS. Makes little difference, but can anyone confirm/correct my assembly?

You wouldn't include fldl and fstpl in the calculation, because these do no useful work. So the loop has really only performed 1 billion operations, which means a rate of about 331 MFLOPS.

Also, a double precision operation does not necessarily take twice as long as a single precision one. On some of the Intel FPUs, it will actually be the same (depending on the operation being performed). The 331 MFLOPS value is a realistic one under some circumstances.

However, a modern FPU will also allow the pipelining of operations, meaning that it can process multiple operations simultaneously provided that they are working on different operands. So something interesting to try is to add a second accumulation to your loop, eg

Code:

for (i = 0; i < operations; ++i) {
        floating_point += float_increment;
        floating_point2 += float_increment2;
}

Such calculations are quite common in real-world problems; for example, the inner cross product of a fast fourier transform involves multiple operands and operations.

MTK358 · 09-27-2010, 07:57 AM

Benchmarking with Python sure won't work. Because it's an interpreted language, must of the computing power goes to interpreting the Python code, not doing computations.