How to use kernel_fpu_begin and kernel_fpu_end correctly?

ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Introduction to Linux - A Hands on Guide

This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.

Click Here to receive this Complete Guide absolutely free.

Above, << 8U is unsigned multiplication by 256, and >> 8U is unsigned division by 256, so the expression is essentially (256 * [I]avgRTT[I] - avgRTT + rtt)/256 but with explicit casting to make sure the expression cannot overflow, and using a bit shift to avoid explicit 64-bit multiplication and division. It should work fine on all 32-bit and 64-bit architectures.

using a bit shift to avoid explicit 64-bit multiplication and division.

I was under the impression that the compiler can be trusted to do this kind of optimization, but I just checked with gcc 4.5.2 (32bit), and it only avoids the division. The version with multiplication does use fewer instructions, perhaps gcc decided it would be faster?

I was under the impression that the compiler can be trusted to do this kind of optimization, but I just checked with gcc 4.5.2 (32bit), and it only avoids the division. The version with multiplication does use fewer instructions, perhaps gcc decided it would be faster?

In normal userspace application code I never "optimize" multiplications to shifts and additions/subtractions (since profiling never points to that kind of code to be a bottleneck). The compilers do a good enough job, and readability is so much more important there.

In kernel and microcontroller code this tends to matter, and we can use more complicated expressions, requiring more skill from developers.

On x86-64 (Athlon II X4 640), using gcc -W -Wall -O3 -fomit-frame-pointer, the two functions

Adding -m32 makes gcc-4.6.3 stupid; it compiles the equivalent only for the multiplication case. gcc-4.4 -m32 -W -Wall -O3 -fomit-frame-pointer compiles the shift case to an explicit shift:

with both running in the exact same time (67 cycles including the function call overhead) for random 32-bit unsigned inputs. (Note that gcc-4.6.3 generates consistently one or two clock cycles slower code!)

Now, avr2 is an 8-bit instruction set, which should explain why the assembly is so long. Multi-register shifts are done in the __ashldi3()/__lshldi3() and __ashrdi3() gcc-built-in functions. Even without going into the details, the code with the explicit shifts (red) is shorter and simpler: just 20 conditional jumps. In comparison, the code with the explicit multiplication (blue) is a lot longer and more complex, something like 36 conditional jumps.

Using hand assembly on an 8-bit architecture, all you really need is the logical equivalent of

Code:

# avg[0] is the low byte of avg, avg[3] is the highest byte
# val[0] is the low byte of val, val[3] is the highest byte
# res[0] is the low byte of result, res[3] is the highest byte
set a0 = 0
set a1 = avg[0]
set a2 = avg[1]
set a3 = avg[2]
set a4 = avg[3]
add res[0] to a0
add res[1] and carry to a1
add res[2] and carry to a2
add res[3] and carry to a3
add 0 and carry to a4
subtract avg[0] from a0
subtract avg[1] and carry from a1
subtract avg[2] and carry from a2
subtract avg[3] and carry from a3
subtract 0 and carry from a4
set res[0] = a1
set res[1] = a2
set res[2] = a3
set res[3] = a4

where carry is the carry flag. All CPUs and microcontrollers do have a carry flag; addition and subtraction with or without carry are usually separate mnemonics. Using avr2 MCU, the above code takes about twenty lines or so, and is at least an order of magnitude faster than the C code.

It is such a drastic improvement compared to the C code, that I'm wondering if I should have used the explicit multiplication only after all: on any limited architectures the code is better written in hand assembly (perhaps inlined, in C source files).

To be honest, I've only ever seen the division case being handled stupidly by the compiler, never the multiplication case. (If you ever run into a reference to __udivdi3(), you'll know what I mean.) That is why I used the explicit shift. Now I think I probably should not have, since the multiplication code is much more readable, and just as good for the generic cases. On the difficult architectures like avr2, writing it in inline assembly gives you an order of magnitude simpler and faster code, so tweaking the little bits is insignificant in comparison.

In other words, a very good point, ntubski. I guess it was an ill-advised premature optimization on my part.

The floating-point hardware of the CPU is fast and reliable: if in your estimation you need to use it in kernel-level code, do so.

Here's why kernel_fpu_begin/end calls are necessary: because the operating system is "lazy" about storing and restoring the floating-point context. It's a slightly expensive thing to do, and ... gosh darn it ... in a useful number of cases, the system might "bounce" into kernel mode, do something (that, as it happens, doesn't involve any floating point at all), and then "bounce" right back to running the same thread/process as before. No floating-point work was done. Therefore, the floating-point context could simply have remained exactly as it was throughout. And so, it's a usefully beneficial optimization for this to be the case. "Floating point kernel code" happens to be the exception rather than the rule. But, if you need it, "just do it."

Last edited by sundialsvcs; 06-25-2012 at 10:31 PM.

It works for 32-bit unsigned integers, or for any fixed-point unsigned 32-bit type. You can use

Code:

avgRTT = (avgRTT * 255 + rtt) >> 8;

for 24-bit signed or unsigned integers if the variables are of u32/i32 type, and for 56-bit signed or unsigned integers if the variables are of u64/i64 type.

For example, if you know you need six decimal digits on the right side, i.e. x.yyyyyy, you can simply multiply the value by a million when storing it in the integer. In the 24-bit case, your unsigned range is [0, 16.777215], and signed range is [-8.388608, 8.388607]. In the 56-bit case, your unsigned range is [0, 72057594037.927935], and signed range [-36028797018.963968, 36028797018.963967]. To read a fixed-point value from a string, you read the two integers on either side of the decimal point, x.y, except only up to six initial digits for y. If you get fewer than six digits for y, append zeros (multiply by ten), until you used six. Then,

Code:

value = x * 1000000 + y;

gives you the fixed-point value. To output it, you can use (a variation of)

You can use any multiplier you choose, even one that is only known at runtime. Powers of two are very common, but for times, powers of ten are much simpler. In the kernel, nanoseconds (nine decimals) seem to be used quite widely; the 56-bit variant gives you a 833-day range. Full 64 bits gives 580 years or so at nanosecond precision, but remember the shift needs the extra eight bits. Microseconds (six decimals) seem to be used in networking; the 24-bit variant gives a range of about 16 seconds. For milliseconds (three decimals), the 24-bit variant gives a range of about four hours forty minutes.

Parsing non-powers of ten fixed-point values gets a bit tedious, unless you accept some error in the least significant digit. In practice, I just parse as if the scaling was the next power of then, multiply by multiplier, and divide by the power of ten I used. It introduces a bit of rounding error, though. And often requires 64-bit division (by a power of ten), using __udivdi3(), if that says anything to you.

The reason I'd prefer you to avoid floating point is that such code is extremely difficult to get into upstream kernels. The first thing you'll be told is to convert the math to fixed point. If you refuse, your code will only run on a limited number of architectures -- a lot of the embedded ones have no hardware floating-point support at all, and just cannot use floating-point kernel code. If you work with kernel code, why not do it in a way that gives you the skills necessary to be a Linux kernel programmer? Anybody can add their own code to the Linux kernel, but not everybody can get their code accepted upstream.

Kernel coding is not about writing code that works, it is about writing code that works and is acceptable to the maintainers/community.

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.