[SOLVED] GCC and -march=native optimization

towheedm · 12-11-2017, 06:49 PM

After reading a lot on the optimizations gain with GCC's -march=native option, I decided to give it a try.

I'm on Debian Stretch and decided to re-build some of the packages as a test bed. The packages built, installed and ran without a hitch.

However, after some further reading I came across what I think might be a conflict between the flags set my -march=native and the actual flags reported for my particular CPU.

From GCC's manual 3.18.55 x86 options:

Quote:

-march=cpu-type
Generate instructions for the machine type cpu-type. In contrast to -mtune=cpu-type, which merely tunes the generated code for the specified cpu-type, -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated. Specifying -march=cpu-type implies -mtune=cpu-type.

And from gcc command:

Code:

$ gcc -march=native -E -v - </dev/null 2>&1 | grep cc1

 /usr/lib/gcc/x86_64-linux-gnu/6/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-mwaitx -mno-clzero -mno-pku --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=haswell

And the flags for my processor, an Intel Core i7 5820K:

Code:

$ grep -m1 ^flag /proc/cpuinfo | tr ' ' '\n' | sort
abm
acpi
aes
aperfmperf
apic
arat
arch_perfmon
avx
avx2
bmi1
bmi2
bts
clflush
cmov
constant_tsc
cqm
cqm_llc
cqm_occup_llc
cx16
cx8
dca
de
ds_cpl
dtes64
dtherm
dts
eagerfpu
epb
ept
erms
est
f16c
flags           :
flexpriority
fma
fpu
fsgsbase
fxsr
ht
ida
invpcid
lahf_lm
lm
mca
mce
mmx
monitor
movbe
msr
mtrr
nonstop_tsc
nopl
nx
pae
pat
pbe
pcid
pclmulqdq
pdcm
pdpe1gb
pebs
pge
pln
pni
popcnt
pse
pse36
pts
rdrand
rdtscp
rep_good
sdbg
sep
smep
ss
sse
sse2
sse4_1
sse4_2
ssse3
syscall
tm
tm2
tpr_shadow
tsc
tsc_adjust
tsc_deadline_timer
vme
vmx
vnmi
vpid
x2apic
xsave
xsaveopt
xtopology
xtpr

There is no SSE3 flag for my processor, which I believe is Haswell-E. However for -march=native, GCC set the processor to Haswell and turns on the SSE3 flag.

My question is, should I also pass the '-mno-sse3' option to disable the SSE3 instruction set? Further down the same GCC manual page says:

Quote:

GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.

Not sure what that means.

So should I '-march=native -mno-sse3' or just '-march=native'.

Any help in understanding this further is greatly appreciated.

Thanks.

sundialsvcs · 12-12-2017, 08:34 AM

When you specify march=native, you are asking gcc to auto-detect the architecture of the build processor and to adapt its behavior to produce code that is optimized for the machine upon which the compiler is now running. I would therefore anticipate that it can see whether this-or-that feature exists, without further clues from you.

However, if you know that a particular feature isn't there, I would think that there is no harm in being more specific if you want to. If you know that SSE3 really isn't there, and fear that the compiler might think that it is (which would surprise me ... those guys are good at what they do), then you could certainly specify that, and see what happens.

pan64 · 12-12-2017, 11:04 AM

haswell is at about 4 years old. If there had been an error like that it would have already been detected/reported and also fixed.
I think using a non-existent instruction (set) will lead to SIGILL.

sundialsvcs · 12-12-2017, 11:21 AM

Quote:

Originally Posted by pan64

haswell is at about 4 years old. If there had been an error like that it would have already been detected/reported and also fixed.

I quite-frankly agree. Maybe the man-page is what is out of date on this very-small point. If you ask gcc to "adapt itself to the architecture of the host upon which it now finds itself," I'll betcha that it will do so correctly – no matter what the man-page says.

Also – "now that even 'run-of-the-mill' microprocessors are running billions(!) of ops-per-second," how much does any of this really matter anymore?

Can anyone, today, actually "hear you scream?"

_roman_ · 12-12-2017, 11:49 AM

Hello from a gentoo user.

First I want to say this topic was a lot discussed on forums.gentoo.org.

Conclusion for myself.

Quote:

ASUS-G75VW roman # cat /proc/cpuinfo

Shows different names for let's call these now cpu features as the gcc compiler does.

Also bear in mind that the gcc manual is quite huge.

The gentoo forum has some commands on how to show what gcc does with the march native subset

For those topics I have read, the result is quite similar or the same.

--

Further reading gentoo documentation like gentoo wiki, gentoo handbook about
/etc/make.conf
It covers the gcc compiler thing

also read about gentoo rizer flags.

--

Quite important for let'S say you do a kernel compile, the job option of make

and the O feature of gcc => also gentoo handbook / wiki => O2, O3, and so on ... => I think it will be explained there much better as I could.

_roman_ · 12-12-2017, 11:54 AM

Quote:

Originally Posted by sundialsvcs

I quite-frankly agree. Maybe the man-page is what is out of date on this very-small point. If you ask gcc to "adapt itself to the architecture of the host upon which it now finds itself," I'll betcha that it will do so correctly – no matter what the man-page says.

Also – "now that even 'run-of-the-mill' microprocessors are running billions(!) of ops-per-second," how much does any of this really matter anymore?

Can anyone, today, actually "hear you scream?"

Sorry I disagree.

Gentoo ~amd64.

The most time consuming package you can compile on gentoo linux is libreoffice.

I changed a specific config for my kernel from generic amd64 to my ivybridge architecture.

I am not quite sure, but I think i calculated two years ago something in around 3-5 Percent speed improvement. Only by building the kernel to my ivybridge cpu subset instead of the generic intel subset.
I compared several runs before this run with the gentoo bash command splat
with at least three runs after this optimization.

--

The binary distros are the worst i some regards, because they are not optimized for your architecture.

I think building a package is a proper benchmark. Building the biggest package as usual before and after certainly is.

--

My improvement is just with generic safe cflags. No ricer flags. So as much freedom as possible to gcc. No unrolling of loops or other stuff

--

Well does it matter. The computer runs less time on highest performance and therefore consumes less power. So a box with 1200 packages to compile for, it safes a lot of money over the time

_roman_ · 12-12-2017, 12:09 PM

Quote:

Originally Posted by pan64

haswell is at about 4 years old. If there had been an error like that it would have already been detected/reported and also fixed.
I think using a non-existent instruction (set) will lead to SIGILL.

Lol nope.

Some bugs are discovered quite late. Look at the intel management engine for example. Dirty cow and others.

I assume you talked about software and hardware in one piece. One can not exist with the other.

Also do not forget those intel nas cpu, which were flawed recently.

Some intel cpus do not age very well

also do not forget the sata bug on intel platforms. Intel does not really test that well their cpu in my point of view. Just overprized for their low quality of service. See amd ryzen and how intel suddenly was able to lower their prices.

Emerson · 12-12-2017, 12:44 PM

Yes your CPU can do SSE3, all instructions of SSE3 are included in SSSE3. So applications looking for SSE3 instructions will find them, this is why -march=native includes it.

_roman_ · 12-12-2017, 01:43 PM

Quote:

Originally Posted by Emerson

Yes your CPU can do SSE3, all instructions of SSE3 are included in SSSE3. So applications looking for SSE3 instructions will find them, this is why -march=native includes it.

No and No and No

What march native does is very well explained in details on forums.gentoo.org. at least a hundreds topics covering this topic what march native does, how you check what gcc does with different settings. in the past even the gentoo wiki had an article about it

https://unix.stackexchange.com/quest...port-from-bash

as you can see pni is used in /cat/proc/cpuinfo but it means also SSE3. As I already said before, some flags are named with other words.

Quote:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
stepping : 9
microcode : 0x1c
cpu MHz : 2294.877
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

the gcc naming is a different. I always check the docs.

Emerson · 12-12-2017, 03:06 PM

So you are saying OP's CPU cannot do SSE3, SSE3 instructions are not included in SSSE3 and -march=native does not include SSE3.

No comments.

ntubski · 12-12-2017, 06:51 PM

Quote:

Originally Posted by Emerson

So you are saying OP's CPU cannot do SSE3,

He is saying OP's CPU can do sse3, because pni includes (or is an alternate name for?) sse3.

Quote:

SSE3 instructions are not included in SSSE3

Yes, he seems to be saying that.

I don't see any CPUs listed in https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html which have SSSE3 but not SSE3, so the point seems rather philosphical.

_roman_ · 12-12-2017, 07:17 PM

quote: https://en.wikipedia.org/wiki/SSE3

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), ... bla bla bla

From a gentoo user perspective.

March native usually does a better job as the user choosen architecture. Most of the time it was wrongly choosen. The gentoo wiki had pages over pages. At the end before march native was really introduced and used, gentoo wiki had a list of every common processor with what march settings to use.

e.g. 3610 QM => ....

--

I hardly can remember any event where on the common "amd64" platform, which also ofc includes these days intel processors, where the march native was wrong.

As said, check how those flags are reported, check the gcc manual. It's a bit confusing, different named.

Not sure when gentoo introduced the march native thing. Also in the past, every new gcc sometimes introduced something fresh to choose from.

The only thing I have choosen Ivybridge explictely is

Quote:

ASUS-G75VW roman # zgrep IVY /proc/config.gz
CONFIG_MIVYBRIDGE=y

Anything else I use march native.

--

careful when you crosscompile

_roman_ · 12-12-2017, 07:31 PM

Quote:

Originally Posted by Emerson

So you are saying OP's CPU cannot do SSE3, SSE3 instructions are not included in SSSE3 and -march=native does not include SSE3.

No comments.

I said

Check the GCC manual

left always is the corresponding flag, last time i checked two years ago

Check alternative names.

A computer does not care for instructions which are a subset + addition of other instructions. That is a technical rant basically.

A computer just checks for is the instruction there? Yes use it? no do not use it? => /proc/cpuinfo

You may be right if those instructions are a subset or not, but it does not matter.

What matters as said explicitely now with that wikipedia link. It seems I failed to explain it more clearly.

I remember instantly knowing for what to look for with the gcc manual and the gentoo wiki regarding march settings.

The gcc output is a bit different and not so obvious in my personal opinion. It helps to check up on how to read it, what it means, because they have choosen the wording quite bad in my point of view.

To rephrase it. I expected that you would also say that PNI is that corresponding flag, as instead arguing with next "not checked yet" subset instruction.

I think the question was, or I understood it that way, Why is there no xxxx flag, which was obvious to myself as a long term gentoo user. It is just named differently. Your technicallities are nice, but you miss the point, the flag is there just named differently.

---

You may be right with your statement that this instruction is a subset of another instruction, but thats just knowledge which is nice to have.
basically generic speaking
check gcc manual
check how else it may be named
check what gcc output is, and check how to read that output.

I most of the time lookup gentoo useflags. what does it really means, what does it do. Same with kernel settings. They are vaguely described.

--

When you use gentoo you see a lot of those "funny text rolling" = compiler + linker output

at the beginning you always see

has x
has y
has z

the computer does not care for the is subset of. it checks for mmx, yes it has it, i use it, no it does not have it, do something else.

Like in real life

there are several words for the same thing, just labeled differently because two countries use a similar language it has two different words for the same stuff. one calls it pni other calls it SSE3

towheedm · 12-13-2017, 07:07 PM

I'm not a programmer so the nitty gritty of gcc is all greek to me. But, while I may not understand the inner workings, I can certainly appreciate what certain option such as funroll-loops and something-inline might do for optimizing.

Then it's possible (repeat possible) that pni is just another name for SSE3. I will do some more reading.

And @roman keeps pointing to the gentoo forums/wikis. The gcc command I posted was from the gentoo forums.

Anyone care to explain the last part of my initial post:

Quote:

GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.

To me it says since it is setting -mavx, it will be replacing all SSEx instructions with the avx equivalence.

Thanks for the replies.

sundialsvcs · 12-15-2017, 08:30 AM

Quote:

Originally Posted by _roman_

Sorry I disagree.

Gentoo ~amd64.

The most time consuming package you can compile on gentoo linux is libreoffice.

I changed a specific config for my kernel from generic amd64 to my ivybridge architecture.

I am not quite sure, but I think i calculated two years ago something in around 3-5 Percent speed improvement. Only by building the kernel to my ivybridge cpu subset instead of the generic intel subset.
I compared several runs before this run with the gentoo bash command splat
with at least three runs after this optimization.

"Generic" is a least-common denominator setting which of course is commonly used by distros who don't want to find their software issuing any instructions that someone's chip does not have. It would be interesting to know (as if you want to re-compile LibreOffice again ... ...) whether "native" would have performed (nearly) as well as "ivybridge" in your case.

I've also used Gentoo for many years – including the old days – and I literally found that the mere fact that the software is being compiled-from-source at all(!) seemed to be the thing that made the most difference. I'd been running Red Hat on the box (for the "free" year which they allowed you, at that time, before they wanted you to start paying), and it was a really tiny box that had originally been sold with Windows 95. Simply by installing Gentoo and letting it do the compile-from-source thing, it was very easy to see that the software was smaller, and ran appreciably faster than before. (In fact, this little box, which I used for many years, was positively quick. "From power-up to ready-to-go in six seconds flat," for instance.)

"Standard Distros," for obvious reasons, purposely don't build for speed nor for small-size. They build for universality: they want to be sure that their binaries will run on everything.