[SOLVED] What can I do to optimize compilation of an Intel pentium dual-core T3400(mobile)?

coexistance · 01-28-2011, 08:50 AM

Hello!

Preface

I'm starting compiling software from source (code I make included).
I already know how to compile software,
but I do not understand nothing of opimizations,
So I need some help with it.
I use Debian GNU/Linux lenny and Linux Mint 10.
I'm using Linux Mint because it has a more modern GCC compiler.

Host system

I'm using now an Acer Aspire 5737Z.

The host distro is Linux Mint 10 - x86_64 edition.
(Linux kernel 2.6.35, GCC 4.4.5)

The processor specifications:
The CPU is an ["Merom-2M" (65nm)]Intel pentium dual-core mobile T3400 (2.17GHz), with 1MB L2 Cache, 667 hz of FSB, with support for MMX, SSE, SSE2, SSE3, SSSE3, Enhanced Intel SpeedStep Technology (EIST), Intel 64, XD bit (an NX bit implementation).

More at:

From Intel

From Wikipedia

Objective

For some software I would like to make maximum perfomance optimizations possible for my processor(computer).

And for another saving resource (because of the battery) optimizations.

Note

I'm compiling mostly "C" and "C++" code on the GNOME/GTK+ environment.
The size of the code goes from some bytes to around a max of 50 mbs.

My compiler sets:

Code:

$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.4.4-14ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

Thanks in advance!
Cheers!

johnsfine · 01-28-2011, 09:18 AM

Quote:

Originally Posted by coexistance

For some software I would like to make maximum perfomance optimizations possible for my processor(computer).

For x86_64 architecture, processor specific optimizations are currently nearly useless. General optimization is useful. Optimizing differently for the specific CPU model generates no significant improvement.

If you want better optimized code, start with full optimization but with debugging symbols, then run the program under a tool, such as oprofile, that does low level, non intrusive, sampling of where the time is spent. Then hand optimize the critical functions.

Note that if you want to improve the algorithm, rather than the code, you may be better off with a more traditional profiler that measures more intrusively (so much less accurately) but effectively measures the whole call tree, so you see which functions call the frequently used functions, rather than just a one dimensional view of where the time is used.

When "hand optimizing" key routines, an important part of the task is providing optimizer hints. So the code might be unchanged, but the compiler optimizer is told more about the code so it can do a better job. Read about the "restrict" attribute on pointers for the best example of that kind of info (in the source code for benefit of the optimizer).

For x86_64 specific hand optimizations (beyond ordinary optimization that you would do for any architecture), one very important area is index variables. Consider very common code looking like

Code:

for ( datatype N=0; N<S; N++ ) {
   ...
   ...  X[N]  ...  X[N+1] ...
   ...
   }

Assume N holds an unsigned value that fits in 31 bits. That is very common. What datatype should N be?

In 32 bit x86, N could be int or unsigned int or long or std::size_t, interchangeable with zero impact on performance (since the value fits in 31 bits unsigned, there is also zero impact on correctness). Most programmers use int, because they do that by default in situations where any one of those four types would give equally correct results.

In x86_64, int is usually the worst choice for performance in that situation and unsigned int is usually the best. The difference is usually tiny, so profile first and think about this issue only for the most time consuming loops.

The loop control itself is equally good with int or unsigned int, and only a little worse with long or std::size_t.

The X[N] expressions depends on context and on the nature of X. But int (for N) is almost always a little worse than any of the other choices. Unsigned int is typically a little better than long, but rarely may be a little worse. std::size_t has the same performance as long.

The X[N+1] expression, if you don't tweak it, is usually better with long or std::size_t than with int or unsigned int, because (X+1)[N] is usually more efficient than X[N+1]. In this situation, the programmer knows (X+1)[N] has the same meaning as X[N+1], but the compiler only knows that if N is 64 bit.

So I often write X[N+1L] so the compiler may optimize it to (X+1)[N] so that unsigned int N will be more efficient than long or std::size_t. You can trust the compiler to know whether (X+1)[N] is better or worse than X[N+1L] and to switch between them as appropriate. But when N is 32 bit, the compiler can't know that (X+1)[N] is the same as X[N+1] so it can't switch.

The 1 is just an example. This logic applies whenever the added or subtracted value is a compile time constant. But if that value is a run time variable, it is usually best for that variable and N both to be unsigned int.

H_TeXMeX_H · 01-28-2011, 09:54 AM

Use these CFLAGS:

Code:

-march=native -O2 -pipe -fPIC

If you want low CPU usage, use JFS filesystem, plus deadline I/O scheduler.

coexistance · 01-28-2011, 12:19 PM

Quote:

Originally Posted by johnsfine

For x86_64 architecture, processor specific optimizations are currently nearly useless.

Thanks johnsfine, but I have a question:
Does that mean that if I use the compiler from my 32 bits only debian lenny (GCC 4.3) instead of the one from the Linux Mint,
I could get specific cpu perfomance boosts?

Also, thanks for pointing the oprofile tool, is very useful.

Quote:

Originally Posted by H_TeXMeX_H

If you want low CPU usage, use JFS filesystem, plus deadline I/O scheduler.

Thanks H_TeXMex_H, but...
Can I create and mount,
just a new JFS filesystem partition and make compilations there,
or do I have to format the entire disk to have the benefits of the Journalised system(and Install the system again)?

Asking because I can make a whole disk backup easily... so formatting won't be much deal.

Thanks for the posts, they were really helpful!
cheers!

H_TeXMeX_H · 01-28-2011, 01:09 PM

You would have to run the system with JFS as the filesystem = backup and reinstall with JFS.

I found out about JFS when I wanted something to run on my laptop, and I wanted performance but low CPU usage ... and this fit it. Also, never had any problems with it.

johnsfine · 01-28-2011, 02:33 PM

Quote:

Originally Posted by coexistance

Does that mean that if I use the compiler from my 32 bits only debian lenny (GCC 4.3) instead of the one from the Linux Mint,
I could get specific cpu perfomance boosts?

That certainly isn't what I meant. But I also don't really know what your question means.

I assume the 32 bit compiler you mean can only build 32 bit code. 32 bit code often runs a tiny bit faster than the source compiled for 64 bit. So maybe you want to compare 32 bit compile vs. 64 bit. Both versions should be runnable on a 64-bit OS (you might need to install some extra 32-bit .so files).

If you want a 32 bit compile on a 64 bit system, you should use -m32 on a 64 bit compiler, rather than using a 32 bit compiler. (A 32 bit compiler could work, but has extra issues for no extra benefit).

When you use -m32 or a 32 bit compiler, you should specify the CPU model, because 32 bit x86 defaults to supporting a wide range of older models. The constraint of supporting older models will reduce the performance on the current model.

So specifying the model for 32 bit makes a bigger difference because it un supports some older models, not because it has great insight into the specified model. Specifying model within x86_64 does little because there is no old x86_64 model lame enough to be worth un supporting.

In general, I think 32 bit x86 GCC is pretty lame. I would not focus on getting best results from it. If your program happens to run a little faster in 32 bit than in 64 bit (as many do), I still would not push forward with 32 bit. I would just investigate the reasons (usually cache misses) that make 64 bit slower and fix things (maybe pool allocation of certain data structures) to take away the 64 bit disadvantage.

Quote:

Originally Posted by H_TeXMeX_H

Use these CFLAGS:

Code:

-march=native -O2 -pipe -fPIC

If you want low CPU usage, use JFS filesystem, plus deadline I/O scheduler.

I don't know what -pipe does (will look up later). -fPIC makes the code slower. Don't use it if you don't need it. -O2 often generates better code than -O3, but it is worth trying both and seeing what results you get. Other specific optimization switches might help. But hand optimizing the key code usually does a lot more.

I'm assuming the program to be optimized uses a lot of user mode CPU time (otherwise the compiler oriented optimization question makes no sense). So the JFS etc. portion of the answer makes no sense. At best that might optimize the non user portion of CPU time (for which compiler oriented optimization questions wouldn't be asked).

H_TeXMeX_H · 01-28-2011, 02:45 PM

It's better to use PIC for x86_64. I don't use -03 because it sometimes produces unstable code (quite often from what I've seen, and not much faster).

For JFS I was responding to:

Quote:

And for another saving resource (because of the battery) optimizations.

I would think JFS qualifies, it uses less CPU than any other filesystem I've tried.

coexistance · 01-28-2011, 07:30 PM

Alright, thanks johnsfine and H_TeXMeX_H

Conclusion

I will use the following settings then:
Code:
$ -march=native -O2 -pipe
Also will format and create an JFS filesystem, with the deadline I/O schedule.

From what I've understood "hand" optimizations are more reliable then compiler ones.
And the specific processors invocations are mostly useless.

Most of the code are loops with pointers... (sadly I'm not very good handling it right now)

I'll try to improve my skills, thanks for the posts!

Cheers!

johnsfine · 01-29-2011, 07:42 AM

Quote:

Originally Posted by coexistance

From what I've understood "hand" optimizations are more reliable then compiler ones.

Only after you use oprofile or similar to find the hot spots.

Quote:

Most of the code are loops with pointers... (sadly I'm not very good handling it right now)

For pointers in loops, the restrict feature is often very powerful at getting the compiler to see optimizations it wouldn't ordinarily see.

Here is a link to documentation of the syntax for restrict. You need to look elsewhere (maybe c99 documentation) to get a detailed understanding of the meaning:

http://gcc.gnu.org/onlinedocs/gcc/Re...-Pointers.html

Roughly: a restricted pointer is a promise by the programmer that any object read or written by through that pointer is not read or written in the same section of code any other way (directly or via another pointer).

If you wrote *p=*q; *p+=*r; the optimizer normally could not change that to the more efficient *p=*q+*r; because it must allow for the possibility that p and r point to the same object. Restricting either p or r ought to make the compiler able to optimize that code. (In my experience, you often need to restrict both p and r to get the compiler to see it).

I used an example in which simply writing the code in the more obvious way in the first place would have made the optimization unnecessary, because only that kind of example is simple enough to highlight just the action of the restrict. There are plenty of more common, slightly more complicated, cases in which restrict lets the compiler see a less obvious optimization.

coexistance · 01-30-2011, 03:23 PM

Okay, thank you johnsfine and H_TeXMeX_H for the help.

I will try to start using the restrict word on "C" code from now on, thanks for the tip.
From what I understood, it only works on C code, because restrict is a keyword from the "C" language (C99) only.

I will also read fully the GCC manual, as might have some new ideas of implement better code.
(sorry johnsfine, I probably made you look for the instruction when I could make it myself).

Also, thanks for the explanation, from what I've understood "restrict" keyword says something like this:
So you want to restrict this variables, then I'll save and lock them so that I can make use of them...

Well, thanks again guys.
Cheers!

johnsfine · 01-30-2011, 03:42 PM

Quote:

Originally Posted by coexistance

I will try to start using the restrict word on "C" code from now on, thanks for the tip.
From what I understood, it only works on C code, because restrict is a keyword from the "C" language (C99) only.

But G++ (Gnu C++) supports that C99 feature.

G++ does not support the restrict keyword spelled restrict. But G++ does support it spelled __restrict__

The usual (I think best) way to deal with that is to use in restrict your C++ code and have some project wide .hpp file (included by all .cpp files in your project) that tests which compiler is in use and #defines restrict as either __restrict__ or nothing. Then you get the benefits of the feature when compiling with appropriate compilers (such as g++) but your code loses only optimization, not correctness, when porting to some other compiler.

Quote:

Originally Posted by coexistance

from what I've understood "restrict" keyword says something like this:
So you want to restrict this variables, then I'll save and lock them so that I can make use of them...

I have no clue what you mean by that. I hope you understand or figure out what restrict actually means. If you don't know assembler programming and you don't know compiler internals, it may be very hard to get a good idea what use the compiler gets from restrict.

But you don't really need to understand that. Consider restrict only in terms of the promise it implies from the programmer to the compiler (the things accessed here through this pointer are accessed here only through this pointer). The only part of that which may be hard to understand well is the meaning of "here" in that sentence. Almost always, the things accessed here through that pointer will be accessed somewhere else by some other method. That doesn't matter to restrict.

As with almost all hand optimization, it is only worthwhile after a tool such as oprofile identifies the hot spots.

BTW, I also looked up the -pipe option to gcc. It may significantly reduce the time needed for gcc to compile your program. But it has no effect of the generated code. I assume you wanted to make your program execute faster and/or use less battery power. Compiling faster is a different topic.

coexistance · 01-30-2011, 09:02 PM

Thanks for your time and patiente johnsfine.

Quote:

The usual (I think best) way to deal with that is to use in restrict your C++ code and have some project wide .hpp file (included by all .cpp files in your project) that tests which compiler is in use and #defines restrict as either __restrict__ or nothing. Then you get the benefits of the feature when compiling with appropriate compilers (such as g++) but your code loses only optimization, not correctness, when porting to some other compiler.

I will use the strict and __strict__ keywords on the preprocessor,
like suggested on your tip (thanks a lot).

Quote:

But you don't really need to understand that. Consider restrict only in terms of the promise it implies from the programmer to the compiler (the things accessed here through this pointer are accessed here only through this pointer). The only part of that which may be hard to understand well is the meaning of "here" in that sentence. Almost always, the things accessed here through that pointer will be accessed somewhere else by some other method. That doesn't matter to restrict.

I see, strict keyword is like a feature,
that is used by a tool (the compiler in this case),
so that it can be interpreted and used by it.

I must confess, I only gave the GCC manual a quick view, as I mentioned I do not understand very much of optimizations.
Once I have time I will read it all.

Quote:

I have no clue what you mean by that. I hope you understand or figure out what restrict actually means. If you don't know assembler programming and you don't know compiler internals, it may be very hard to get a good idea what use the compiler gets from restrict.

From what I understood knowing the compiler internals and assembly is very important for better optimizations.
That means I should put in acount try learn the GCC API and the assembly i386.

I'm probably still very newbie at programming too... it's just all very recent(retrying).
But I want to give my best, and learn the most I can.

When I was younger, I used to learn almost a whole language on 3 days or so...
One week later, and all the code I've made just looked like from another galaxy.

So now...nice and easy.

I shall say that your posts we're indeed very helpful,
I've already learned only reading your posts(hope didn't made waste your time).
Thanks very much again!

Cheers!