LinuxQuestions.org - Performance of switch cases in C

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Performance of switch cases in C (https://www.linuxquestions.org/questions/programming-9/performance-of-switch-cases-in-c-4175516251/)

Performance of switch cases in C

Part 1) An interesting finding, not a question.

I did a test to compare the performance on the following statements in C:

Fall-Through

Code:

switch (loc) {

case 0: ...; /* fall-through */

case 1: ...; /* fall-through */

case ...

case 15: ...;

    break;

case 16: ...; /* fall-through */

case 17: ...; /* fall-through */

case ...

case 31: ...;

    break;

...

case 255: ...;

    break;

}

case-break-case-break

Code:

switch (loc) {

case 0: ...; ...; ...; ...; ...  ...; ...; break;

case 1: ...; ...; ...; ...; ...  ...; break;

case ... /* you get the idea */

case 15: ...; break;



case 16: ...; ...; ...; ...; ...  ...; ...; break;

case 17: ...; ...; ...; ...; ...  ...; break;

case ...

case 31: ...; break;



...

case 255: ...; break;

}

Result shows that the case-break-case-break statement is more efficient than the Fall-Through.

Part 2) The question

Now since I strongly believe in my finding in Part 1), I'm going to use this case-break-case-break approach for maximum efficiency. If there is no fall-through, then the case-groups can be re-arranged in any order I want. My question is, does the position of the case-group in the switch-statement affect performance when it is called very often?

Say the variable 'loc' is in normal distribution ranges from 0 to 255, then if should I put case 127: case 128: first and case 0: case 255: last for maximum efficiency?

fyi.
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52)

I think you will never get a "this or that is better" answer, because it always depend on (at least two different) things:
1. it depends on the functionality, what is implemented in those cases. Your second example needs usually more space, but
2. it also depends on the optimization, and that may completely reorganize the flow of the program.

My very personal opinion is the following: optimization will do a lot of things and probably knows some tricks better (especially on low level - assembly), but we can have a much better view on the code and functionality and therefore sometimes we can make much better high level optimizations.

Is this what you mean?

Quote:

Originally Posted by Jerry Mcguire (Post 5226671)

Fall-Through

Code:

switch (loc) {

case 0: ...; /* fall-through */

case 1: ...; /* fall-through */

case ...

case-break-case-break

Code:

switch (loc) {

case 0: ...; < case 1 etc. actions .... >; break;

case 1: ...; < case 2 etc. actions .... >; break;

case ...

If this is the case then I don't see case-break-case-break doing anything other than adding more code.

Anything else and the actions of the two pieces of code won't be the same.

Duplicating code is bad. Very bad. Don't do it.
Even if you gain 0.01 percent in speed, it doesn't worth it.

Thanks for replying :)

The gain is about 3%. The repeated code is written in the preprocessor to keep it manageable.
I gained access to a large amount of processing power recently so I re-did the Eternity II search for fun, achieved over 1 peta-legitimate moves per day.

Any insight on Part 2 the question?

Have you tried examining the assembly code that's produced in each case?

My guess is that the first one checks the condition, then jumps to the code in the default block, while the second one checks the condition, then continues to the very next instruction that has been inlined after it. The second one would not only save a jump, but also cache much better.

Do check the assembler to confirm or refute that.

Good read on the subject:

http://741mhz.com/switch/

a jump table , something like array[255] , is very fast
and something like this might be generated for the non fall through part
it helps the compiler if you cases are ordered,
the same is valid for if, else if, (I read this some time ago)

I once saw program, searching char,
also with a lot of fall through
the fastest was a char[255] and have the search values on the location
the case was then something like *(&array + val)
was faster than if and case,
but if there is code in the case it might be different, would be interesting to test, if you do it, please tell the results

I don't know a bit about assembly but instinct says the same as your view. I'm guessing each fall-through involves a jump of execution, so on average the 16 cases (15 fall-throughs) adds 7.5 jumps per iteration. The source code is just an example, not the one that saves 3%.

:) But again, I need opinion on the position of the case-group. Instinct also says the position doesn't matter, since it's 1 jump any way. But instinct doesn't tell about the efficiency of switch(loc) on 256 cases. I have no idea how this is implemented in the assembly code. If it is a if-else-if like expansion, I would say moving the case 127 to the front would likely help... I don't know.

you should read the link Guttorm has proided, it contains all info you search and eve more

Quote:

Originally Posted by Jerry Mcguire (Post 5226724)

I'm guessing each fall-through involves a jump of execution, so on average the 16 cases (15 fall-throughs) adds 7.5 jumps per iteration.

I would have thought the opposite - mainly that a fall-through means "don't jump" - conditionally or not.

What optimization options are you using? I'm surprised that gcc isn't inlining the first case, and if is then the two should actually be identical.

Anyway:

As mentioned, the cost of a jump is not the jump instruction itself (which is negligible), it's the potential for cache misses (which isn't). CPU's fetch data and instructions into the cache according to the principle of locality; when an instruction is executed, the next few instructions are fetched into the cache. If the instructions are not in the cache, then they must be fetched from main memory, which, according to the following graphic, is 100 times slower:

http://www.eecs.berkeley.edu/~rcs/re...e_latency.html

Let's say that fetching an instruction from the cache takes 1ns, and fetching a block of instructions from main memory takes 100ns.

The instructions are laid out consecutively when they're in main memory.

1. instruction 1
2. instruction 2
3. instruction 3
...
n. instruction n

And if the CPU instruction cache has room for eight instructions, then they're fetched in consecutive blocks of 8. With no jumps, it would be: instructions one through eight, then instructions 9 through 16, etc.

To start, the CPU fetches the first eight instructions from main memory into the cache. It fetches instruction 1 from the cache, executes it, fetches instruction 2 from the cache, executes it, and so on. Each instruction takes 1ns to fetch from the cache.

But what if there's a jump? The CPU fetches instruction 3 from the cache, and the instruction is JUMP TO INSTRUCTION 20. Instruction 20 is not in the cache, so the CPU first clears the cache, then fetches instructions #17 through #24 from main memory into the cache. That takes an extra 100ns! It then executes instruction 20. Because of this, a jump-plus-instruction pair is 100 times more expensive than just the instruction itself.

That's very oversimplified, of course (I am not a CPU engineer), but it should make it clear where the delay is probably coming from.

Quote:

Originally Posted by psionl0 (Post 5226755)

I would have thought the opposite - mainly that a fall-through means "don't jump" - conditionally or not.

Fall-through should mean "unconditional jump", shouldn't it?

If you are looking into optimization, the easiest way to optimize is trial and error. You can also mess with gcc compile options.

Switches tend to be easy for the compiler to optimize, and gcc will optimize them as best it can. If you want to see what it is doing you can use the '-save-temps' option and look at the assembly. You don't really have to know assembly to guess at what is going on. Lots of testing will still be required for maximum performance.

Keep it simple and logical. Compilers are written to be able to understand and optimize logical constructs as assembly instructions. The easier your code is to understand by the compiler, the faster it will be.

This post was originally some examples of the assembler that would be produced in each case, but I've given up trying to come up with meaningful examples and compare them. I suggest doing the investigations yourself:

Using GCC to produce readable assembly?

You cannot determine why one is faster without actually looking at and understanding the generated assembly.

I recommend Programming from the Ground Up if you need an assembly primer.

Also:

Quote:

Outsmarting a production quality compiler these days is a nearly impossible task. A programmer should not even try to optimize the code that is not proven to be a bottleneck by carefully profiling the whole program. If a piece of code is proven to be slow and there is an obvious optimization that compiler has failed to perform, shop for a better compiler. Start by optimizing the logic and not the code — doing less steps where possible, avoiding chaotic dynamic memory manipulations, using well designed data structures — will pay off more than any micro-optimization.

Or, more simply:

Quote:

Originally Posted by The Elements of Programming Style (Kernighan & Plauger):

Don't "diddle" code to make it faster ... find a better algorithm.