LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 09-25-2010, 08:48 AM   #1
duskxiv
LQ Newbie
 
Registered: Mar 2010
Posts: 6

Rep: Reputation: 0
Ways of measuring FLOPS


I've found various programs and methods to measure FLOPS, but I don't seem to understand them. What is wrong with the following?

test-flops.py
Code:
#! /usr/bin/env python
float_increment = 1.03 # random
start = 57.24 # random
floating_point = start

for i in xrange(10 ** 6):
    floating_point += float_increment
Code:
PROMPT$ time ./test-flops.py
As far as I can see, this times a program that performs 1 million FLOPs (namely incrementation).

This outputs about 1.2 seconds on my (reasonably fast) machine, giving 840kFLOPS, which is much too low.

Edit: This post may be in the wrong section; a C translation increments the result by three orders of magnitude (ie, py sucks), but this is still considerably less (~ 7-fold) than published values.

Last edited by duskxiv; 09-25-2010 at 08:59 AM.
 
Old 09-26-2010, 03:05 AM   #2
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Bookworm (Fluxbox WM)
Posts: 1,391
Blog Entries: 54

Rep: Reputation: 360Reputation: 360Reputation: 360Reputation: 360
Benchmarking is a non-trivial exercise. There are many factors (especially for floating point units) that can impact the measurements - data paths, the choice of operations, etc.

Still, if you want to continue on with your own implementation (and it is a good learning exercise), here are some quick thoughts:

* use a compiled language rather than an interpreted one for tight loops and time critical code (this is why your C implementation runs a lot faster than the Python one)

* make sure you turn on optimization (eg for gcc use '-O3', and there are other flags that can also help), otherwise the floating point values will get loaded and unloaded from the floating point processor each time you perform the operation

* unroll the loop, so that loop overheads do not significantly impact the measurement, eg your example code would become
Code:
for i in xrange(10 ** 5):
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
    floating_point += float_increment
(or perhaps better, get the compiler to do it for you using '-funroll-loops', since if you unroll too far it can change the way the instruction cache works)

* have a look at the generated code (use the gcc '-S' flag to generate an assembler file) so that you can see what instructions you are actually benchmarking; you don't need to understand all the assembler, just the part around the loop (there will be a label and a jump back to it)

* be careful about special cases; for example, you are adding a number to a sum that will become very large

* be aware of the different floating point types (eg double precision vs single precision)

Last edited by neonsignal; 09-26-2010 at 03:35 AM.
 
Old 09-26-2010, 08:15 AM   #3
duskxiv
LQ Newbie
 
Registered: Mar 2010
Posts: 6

Original Poster
Rep: Reputation: 0
Hm. Okay, that seems extremely helpful, thanks!

With that in mind, I get reasonably accurate results without having to resort to anything particularly complicated. My findings:

(Hopefully helpful to someone; I can't find this information elsewhere so thought I'd post it. I'm just going to include everything I, as a relative newbie, found interesting in this exercise.) If anyone has any comments or suggestions, I'd much appreciate it.

flops.c
Code:
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    double float_increment = 0.0000000000019346;
    double start = 57.240000;
    double floating_point = start;
    long long i;
    long long operations = 1000000000;
    for (i = 0; i < operations; ++i) {
        floating_point += float_increment;
    }   
    printf("%lf\n", floating_point);

    return EXIT_SUCCESS;
}
Outputs 57.241935, so no large numbers. I note that I'm measuring double precision FLOPS.

Notes on compilation flags:
Code:
    * --unroll-loops flag does nothing iff -O[2|3] set

    * -O2/O3 etc, if float_increment very small (ie < 10^-13ish) the
      program  executes in 0.001s regardless of the value of 
      "operations" (be it 10^6 or 10^18). i assume the compiler 
      correctly decides that adding a lot of zeroes (to "double"
      accuracy) to a number will not change the number. 

    * -O increases speed similarly to -O2 or -O3 for float_increment >
      10^-13ish, but for lesser values:
operations == 10^9 --> time = 0.2s
operations == 10^8 --> time = 0.02s
operations == 10^10 -> time = 20s, a phenomenon which I can't explain 
    so i stuck with -O3, where no such anomaly occurs.
Reading the assembly was the most valuable part. (By the way, I'm quite amazed at how human-readable Intel assembly is. It sort of makes sense and I started learning it five minutes ago.)

flops.s
Code:
	.file	"flops.c"
	.section	.rodata
.LC2:
	.string	"%lf\n"
	.text
.globl main
	.type	main, @function
main:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	subl	$64, %esp
	fldl	.LC0
	fstpl	40(%esp)
	fldl	.LC1
	fstpl	32(%esp)
	fldl	32(%esp)
	fstpl	56(%esp)
	movl	$1000000000, 24(%esp)
	movl	$0, 28(%esp)
	movl	$0, 48(%esp)
	movl	$0, 52(%esp)
	jmp	.L2
.L3:
	fldl	56(%esp)
	faddl	40(%esp)
	fstpl	56(%esp)
	addl	$1, 48(%esp)
	adcl	$0, 52(%esp)
.L2:
	movl	48(%esp), %eax
	movl	52(%esp), %edx
	cmpl	28(%esp), %edx
	jl	.L3
	cmpl	28(%esp), %edx
	jg	.L5
	cmpl	24(%esp), %eax
	jb	.L3
.L5:
	movl	$.LC2, %eax
	fldl	56(%esp)
	fstpl	4(%esp)
	movl	%eax, (%esp)
	call	printf
	movl	$0, %eax
	leave
	ret
	.size	main, .-main
	.section	.rodata
	.align 8
.LC0:
	.long	-672592136
	.long	1024837658
	.align 8
.LC1:
	.long	1374389535
	.long	1078763192
	.ident	"GCC: (GNU) 4.5.1"
	.section	.note.GNU-stack,"",@progbits
As I understand this, the program __basically__ runs as follows:
* run main(19 instructions). we are now at L2
* L2 executes 3 instructions, and jumps to L3 if i < 1000000000 (see constant earlier in the code). we are now at L3
* L3 executes 5 instructions, somewhere in here i gets incremented, and we go back to L2.
* Eventually "jl L3" will be false, so we get to "jg L5", which executes 8-odd instructions and exits.

So, we more or less execute [length(L3 + {pre-jump L2}) == 9] * [big constant == 100000000] == 9000000000 instructions, most of which are on double-precision floating point values so just say all of them are FLOPs.

Code:
$ gcc -O3 -o flops flops.c
$ time ./flops
57.241935

real	0m3.026s
user	0m3.023s
sys	0m0.000s
Hence FLOPS = 9000000000 / 3.026 == 3 GDFLOPS == 6 GFLOPS. Just as required, more or less. . A bit big. The CPU was 2x Intel Atom N280 at 1.66GHz.

If I'm correct that:
* movl is a stack (ie, non-float) operation
* cmpl, addl, adcl, j[l|g] are integer operations
* fldl, faddl, fstpl are our float operations,
then we have only 3000000000 double-float operations --> 2 GFLOPS. Makes little difference, but can anyone confirm/correct my assembly?
 
Old 09-26-2010, 09:43 AM   #4
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Bookworm (Fluxbox WM)
Posts: 1,391
Blog Entries: 54

Rep: Reputation: 360Reputation: 360Reputation: 360Reputation: 360
Thanks for posting back.

The flag was actually '-funroll-loops', not '--unroll-loops'. If you produce the assembler from your code using these flags 'gcc -O3 --funroll-loops -S', the loop section should end up looking something like this:

Code:
.L2:
	fadd	%st, %st(1)
	addl	$8, %eax
	cmpl	$1000000000, %eax
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	fadd	%st, %st(1)
	jne	.L2
You are correct that the fadd etc instructions are the floating point ones. fldl is a load and fstl is a store; these just do transfers to and from the FPU stack. You will notice in the loop above that there are no fldl or fstl, because the optimization removes these (they happen before and after the loop). And the compiler has unrolled the loop so that 8 iterations are done at a time.

Last edited by neonsignal; 09-26-2010 at 10:11 AM.
 
Old 09-26-2010, 04:45 PM   #5
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Bookworm (Fluxbox WM)
Posts: 1,391
Blog Entries: 54

Rep: Reputation: 360Reputation: 360Reputation: 360Reputation: 360
Quote:
Originally Posted by duskxiv View Post
* fldl, faddl, fstpl are our float operations,
then we have only 3000000000 double-float operations --> 2 GFLOPS. Makes little difference, but can anyone confirm/correct my assembly?
You wouldn't include fldl and fstpl in the calculation, because these do no useful work. So the loop has really only performed 1 billion operations, which means a rate of about 331 MFLOPS.

Also, a double precision operation does not necessarily take twice as long as a single precision one. On some of the Intel FPUs, it will actually be the same (depending on the operation being performed). The 331 MFLOPS value is a realistic one under some circumstances.

However, a modern FPU will also allow the pipelining of operations, meaning that it can process multiple operations simultaneously provided that they are working on different operands. So something interesting to try is to add a second accumulation to your loop, eg

Code:
for (i = 0; i < operations; ++i) {
        floating_point += float_increment;
        floating_point2 += float_increment2;
}
Such calculations are quite common in real-world problems; for example, the inner cross product of a fast fourier transform involves multiple operands and operations.

Last edited by neonsignal; 09-26-2010 at 04:46 PM.
 
Old 09-27-2010, 07:57 AM   #6
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 723Reputation: 723Reputation: 723Reputation: 723Reputation: 723Reputation: 723Reputation: 723
Benchmarking with Python sure won't work. Because it's an interpreted language, must of the computing power goes to interpreting the Python code, not doing computations.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: 15 Ways Nokia’s N900 Is Better Than Apple’s iPhone (and 5 ways it’s not) LXer Syndicated Linux News 0 11-14-2009 08:20 AM
LXer: Flip Flops Are Evil LXer Syndicated Linux News 0 09-26-2009 10:10 AM
How to find FLOPS of a specific program caglarozdag Linux - Software 0 05-26-2008 10:12 AM
a way to measure how much flops your PC can do? pusrob Linux - Hardware 3 11-14-2007 10:54 AM
FTP flops ne21 Linux - Networking 2 06-18-2003 12:50 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 04:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration