benchmark problems : why do CPU cores run slow?

maxreason · 02-23-2012, 03:32 AM

I'm trying to run benchmarks on some SIMD assembly language routines I wrote recently, and I'm getting strange and results.

I'm trying to compare the speed of the following routines:
#1: 4x4 matrix multiply in C compiled as 32-bit mode on winxp64.
#2: 4x4 matrix multiply in C compiled as 64-bit mode on ubuntu64.
#3: 4x4 matrix multiply in SIMD assembly as 32-bit mode on winxp64.
#4: 4x4 matrix multiply in SIMD assembly as 64-bit mode on ubuntu64.

The matrices contain double-precision elements.

Yes, there are 3 different sets of source code - the C version (#1, #2), the 32-bit assembly version (SSE3-level with xmm registers only), and the 64-bit assembly version (AVX+FMA4-level with ymm registers).

The 32-bit assembly is pretty good code, but is more than twice as long as the 64-bit assembly routine since the xmm registers only hold and operate on two f64 values at a time, while the ymm registers hold and operate on four f64 values at a time. Plus there are twice as many SIMD registers in 64-bit mode (16 versus 8). If I do say so myself, the 64-bit assembly code is superb --- a grand total of 50 instructions, no loops, and takes maximum advantage of the latest and greatest AVX and FMA4 instructions on 256-bit ymm registers.

I am executing the rdtsc or rdtscp assembly language instructions to capture 64-bit CPU clock timer values to perform my measurements. Thus my measurements are in CPU cycles. It is my understanding that both of my modern AMD CPUs (4-core on windoze and 8-core on ubuntu) have "constant speed rdtsc clocks" that don't change speed if and when the CPUs are throttled back (to save power or reduce heat). Just for the record, the 4-core windoze CPU is clocking at 3.4GHz and the 8-core ubuntu CPU is clocking at 3.6GHz.

Anyway, here are my problems.

#1: I'm not getting repeatable results on the linux side. For a while each 64-bit assembly language matrix multiply was taking about 180 timer units to execute, and the C version was just barely over 360 timer units. For a while both values dropped down to 100 and 200 timer units, but that didn't last very long. Later on in my testing these values increased to about 400 and 800. Then later on they were 2600 and 1800 (yes, now the C version was faster than the 64-bit assembly language version).

All the while, the windoze machine reliably generated consistent numbers: about 285 and 100 timer units for the C-code and the 32-bit assembly code.

#2: In trying to figure out why my ubuntu64 machine was generating such wildly changing results, I printed out /proc/cpuinfo now and then. When I first turned on the machine, /proc/cpuinfo reported all 8 cores were running at 3600MHz (as they should). Later on /proc/cpuinfo reported all 8 cores were running at 1400MHz. Note that my bechmark application is a single-core application. So even if ubuntu needed to slow down some CPUs to save power and heat generation, it shouldn't slow down the CPU running the code!

#3: When and how does ubuntu64 throttle CPU cores up and down? I mean, if ubuntu64 has the CPU cores throttled down slower because "not much is happening", surely it will press the pedal when some compute-intensive routine is run... right? Or not? Does setting the application priority change how these decisions are made? Why when I'm running a billion matrix-multiplies in a tight timing loop does ubuntu64 /proc/cpuinfo still report all 8 cores at 1400MHz?

#4: Why is my best time on the ubuntu64 machine no faster than the best (and consistent) time on the windoze64 machine, even though the windoze64 machine is executing a 32-bit version of the matrix multiply routine (that has over twice as many instructions in it)?

Anyway, I'm confused. Seems like I'm missing two or three crucial insights to make sense of this.

PS: I pretty much get the same timing per execution of the function whether I execute the function 1 time, 1 thousand times, 1 million times or 1 billion times. So at least that part makes sense.

ukiuki · 02-23-2012, 10:48 PM

I have seen that slow down on modern processors with modern motherboards but i didn't bother with that until now. And your problem make me ask why it slows down? Well the 1st thing that come in my mind is the modern motherboards have those settings to operate in different states like: Performance, Silent, Normal, so it may be related if the motherboard is set to operate in a state to save power. Yet new processors also have not just the real cores but the virtual cores, it is so powerful that doesn't even need all the speed to process that code(unless your code force it to full), it is another thing that may be related.
Now are you 100% sure your code is running only on one core? How do you know if it isn't spreading among the other cores?
Have you tried to run a virtual machine with only one core assigned to it? And then run the code, at least there you are sure there is only one core.
About the kernel you know there is a way to set how the processor is going to behave like the motherboard settings.
As you can see here under CPU Frequency Scaling. And this maybe the case since Ubuntu use generic kernel and probably have all these options enabled.

Quote:

Originally Posted by maxreason

....Anyway, I'm confused. Seems like I'm missing two or three crucial insights to make sense of this.
......

Well life is usually confusing!! I hope this bring some lights to your problem.

Regards

maxreason · 02-24-2012, 08:21 PM

Quote:

Originally Posted by ukiuki

I have seen that slow down on modern processors with modern motherboards but i didn't bother with that until now. And your problem make me ask why it slows down? Well the 1st thing that come in my mind is the modern motherboards have those settings to operate in different states like: Performance, Silent, Normal, so it may be related if the motherboard is set to operate in a state to save power. Yet new processors also have not just the real cores but the virtual cores, it is so powerful that doesn't even need all the speed to process that code(unless your code force it to full), it is another thing that may be related.
Now are you 100% sure your code is running only on one core? How do you know if it isn't spreading among the other cores?
Have you tried to run a virtual machine with only one core assigned to it? And then run the code, at least there you are sure there is only one core.
About the kernel you know there is a way to set how the processor is going to behave like the motherboard settings.
As you can see here under CPU Frequency Scaling. And this maybe the case since Ubuntu use generic kernel and probably have all these options enabled.

Well life is usually confusing!! I hope this bring some lights to your problem.

Regards

Thanks for the reply. I went through the BIOS again very carefully looking for anything that might explain this. AFAK, I don't have this system set up to be a notebook computer or anything similar. This is a "performance system" for developing my high performance 3D simulation/game engine and other CPU intensive applications.

Now I have a few applications installed on this system (64-bit ubuntu 10.04 LTS) to help me understand what's happening.

One is an applet that displays four separate temperature icons on the gnome panel at the top of my display:

Two icons are labeled GPU-core and GPU-ambient, and they stay close to 38C and 45C.

I assume the other three icons are "southbridge", "northbridge" and "CPU" temperatures.

The first two icons always display about 31C and 35C.

When I run my benchmark, the third icon display rises from 32C to about 51C~54C just before the benchmark ends (which takes roughly 60 seconds when I run 1-billion loops instead of the usual 1-million). When the code ends, that icon display drops back to 31C~35C within a few seconds.

While this is running, I also have a "system monitor" window on the desktop that displays "CPU history" individually for all 8 cores, "memory swap history" and "network history". This clearly shows one and only one CPU jump from near 0% consumption to a flat out level 100% consumption for during the whole time of the benchmark. This doesn't surprise me because the program that's running is all my code, and I don't do any multi-threading in it... yet.

I also have another "CPU frequency monitor" application running, which displays a "current CPU speed" icon on the same gnome panel that displays the various temperatures. It does indeed let me set the CPU speed of each of the 8 cores individually, to:
o 3.60GHz
o 3.30GHz
o 2.70GHz
o 2.10GHz
o 1.40GHz
o conservative
o ondemand
o performance
o powersave

I did not have these application installed or running when I ran my original email, so now I can report a bit more information about what's happening based on observing these monitoring icons and displays while the benchmark is running.

First conclusion. The "CPU speed" application works. When I force all 8 cores to run at 3.60GHz, they do (based upon how long the benchmark takes to execute). When I force all 8 cores to run at 1.40GHz, the do (based on the benchmark taking nearly 3 times longer to run). When I force all 8 cores to run "ondemand", and I check the various CPU speeds while the benchmark is running, 7 cores stay at 1.40GHz and the core running the application runs at 3.60GHz.

In my opinion, this should be the default operating condition of all non-battery powered PCs. Why not? The only possible proviso that I can imagine is a "quick response scenario" application that may only be triggered now and then, but must execute and respond in the minimum elapsed time possible. In that case the question becomes "how quickly are CPU speeds promoted from minimum-speed to maximum-speed when they start doing their work". If the answer is 1 second, or even 0.01 second, then probably "ondemand" isn't appropriate in those cases. But otherwise...

##########

Given everything learned above, the following is what I have to report.

1: I confirmed that "ondemand" CPU speed takes a while to increase the CPU speed, so it completely screws up benchmarks and any application that needs to respond at full speed (but might sit idle waiting for the condition that requires its quick attention).

----------

2: Here are the speeds of my f64mat4x4 matrix multiply routines on ubuntu64 and winxp64 (based upon a loop of 1-million executions each, then dividing the total number of cycles consumed by 1-million):

Code:

 
 winxp64 C-code:  280 clock cycles
 winxp64  asm32:  100 clock cycles
 winxp64  asm64:  to be determined (when I convert asm64 to masm format)
 
ubuntu64 C-code:  275 clock cycles
ubuntu64  asm32:  100 clock cycles
ubuntu64  asm64:   46 clock cycles

Anyway, THAT makes sense! There should be little difference between ubuntu64 and winxp64, and indeed there isn't. However...

----------

3: Before I run the benchmark with 1-million loops, I measure the cycles to execute each function ONE time only. To eliminate cache issues, I execute the function call 3 times, then capture the timer, then execute the function call 1 time, then capture the timer again, then subtract the two timer values to determine number of cycles to compute the function. Here are the results that do not make sense to me:

Code:

 
 winxp64 C-code:  450 clock cycles
 winxp64  asm32:  345 clock cycles
 winxp64  asm64:  to be determined (when I convert asm64 to masm format)
 
ubuntu64 C-code:  440 clock cycles
ubuntu64  asm32:  250 clock cycles
ubuntu64  asm64:  115 clock cycles

The question is, why does ONE execution of a function take longer than 1-million executions of that same function? I can't see how the cache is involved, because remember, before I measure the time of the single function call, I call that function 3 times. Here is the code, just so you understand what I'm doing.

Code:

 
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2); // get code and data into L1 cache
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2); // get code and data into L1 cache
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2); // get code and data into L1 cache
  timer_get_now (&tfirst);
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2);
  timer_get_now (&tfinal);
  tdelta = tfinal - tfirst;

versus...

Code:

 
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2); // get code and data into L1 cache
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2); // get code and data into L1 cache
  error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2); // get code and data into L1 cache
  timer_get_now (&tfirst);
  for (i = 0; i < 1024; i++) {
    for (j = 0; j < 1024; j++) {
      error = math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4 (&d0, &s1, &s2);
    }
  }
  timer_get_now (&tfinal);
  tdelta = tfinal - tfirst;
  tdelta = tdelta >> 10;
  tdelta = tdelta >> 10;

Why would it take twice as long to execute a function when it is executed just once versus a million times (after cache related issues are removed from the equation)? Note that I have all the CPU cores set to execute at the maximum speed (3.60GHz), so it shouldn't be related to CPU speed setting issues like "ondemand" delay. Does this make sense to anyone?

----------

PS: I'm rather pleased with SIMD performance! Good job, AMD! Pretty impressive to execute matrix multiply on two 4x4 double-precision matrices in 12 nanoseconds, huh? Both these AMD CPUs are executing just under 1 instruction per clock cycle, SIMD instructions included! My 64-bit assembly executes 40 SIMD instructions and 8 regular instructions. Only 8 regular instructions? Yes, since arguments are passed in registers and I don't need to create a stack frame (since this function doesn't need any stack variables or call any other functions), all I need to do is check the 3 input pointer arguments for zero to catch bad input arguments... that plus generate a return value of zero to return, and finally the return instruction itself.

----------

Okay, okay... you want the code... just for educational purposes, I'm sure. :-)

If you want to steal it, you must leave my copyright notice in the 64-bit assembly. If you want the 32-bit equivalent, contact me.

Code:

#
#
# ####################################################################
# #####  math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4_asm64()  #####  this is the same as the above function, except the loop is unrolled --- test to see which is faster
# ####################################################################
#
# NOTE:  the following code formats correctly if TAB characters space to next 4 character boundary
#
# this source code is copyright 2012 by Max Reason
# the MASM format of this source code is also copyright 2012 by Max Reason
# you may insert [and modify] this code in your program BUT you must not alter or remove the above copyright notices
#
# NOTE:  this routine is written for 64-bit linux
# NOTE:  win64 function protocol passes arguments in DIFFERENT registers and specifies different registers be preserved across function calls, including some XMM/YMM registers.
# NOTE:  the gas/ATT opcode register/argument order below is reverse of MASM register/argument order
#        thus "vfmaddpd %ymm12, %ymm8, %ymm4, %ymm0" is equivalent to MASM "vfmaddpd ymm0, ymm4, ymm8, ymm12"
#
#
# #####  IMPORANT  #####
#
# This function doesn't call any other functions or access any local variables.
# Therefore, we do not need to perform function prolog or epilog code.
#
# arg0 = %rdi : address of f64mat4x4 destination
# arg1 = %rsi : address of f64mat4x4 argument #1
# arg2 = %rdx : address of f64mat4x4 argument #2
#
.text
.align 64
math_f64mat4x4_equal_f64mat4x4_times_f64mat4x4_asm64:
	orq		%rdi, %rdi								# is 1st argument zero (an invalid address)?
	jz		vvm_argument_invalid64					# yup
	orq		%rsi, %rsi								# is 2nd argument zero (an invalid address)?
	jz		vvm_argument_invalid64					# yup
	orq		%rdx, %rdx								# is 3rd argument zero (an invalid address)?
	jz		vvm_argument_invalid64					# yup
#
# zero contents of all SIMD resgisters
#
	vzeroall										# zero all components of all ymm registers
#
# load entire f64mat4x4 matrix o[] into ymm4 to ymm7 (old matrix == arg1)
#
	vmovapd			0(%rsi),  %ymm4					# ymm4.0123 = o[00] : o[01] : o[02] : o[03]
	vmovapd			32(%rsi), %ymm5					# ymm5.0123 = o[10] : o[11] : o[12] : o[13]
	vmovapd			64(%rsi), %ymm6					# ymm6.0123 = o[20] : o[21] : o[22] : o[23]
	vmovapd			96(%rsi), %ymm7					# ymm7.0123 = o[30] : o[31] : o[32] : o[33]
#
# get 4 vectors with element #0 of each row in f64mat4x4 matrix n[] (new matrix == arg2)
#
	vbroadcastsd	0(%rdx),  %ymm12				# ymm12.0123 = n[00] : n[00] : n[00] : n[00]	# this brings n[00] to n[03] into L1 cache --- and probably also n[10] to n[13]
	vbroadcastsd	32(%rdx), %ymm13				# ymm13.0123 = n[10] : n[10] : n[10] : n[10]	# this brings n[10] to n[13] into L1 cache
	vbroadcastsd	64(%rdx), %ymm14				# ymm14.0123 = n[20] : n[20] : n[20] : n[20]	# this brings n[20] to n[23] into L1 cache --- and probably also n[30] to n[33]
	vbroadcastsd	96(%rdx), %ymm15				# ymm15.0123 = n[30] : n[30] : n[30] : n[30]	# this brings n[30] to n[33] into L1 cache
#
# compute 1st part of f64mat4x4 result (all 16 matrix locations in the following 4 instructions)
#
	vmulpd			%ymm12, %ymm4, %ymm0			# ymm0.0123 = (o[00] * n[00]) : (o[01] * n[00]) : (o[02] * n[00]) : (o[03] * n[00])
	vmulpd			%ymm13, %ymm4, %ymm1			# ymm1.0123 = (o[00] * n[10]) : (o[01] * n[10]) : (o[02] * n[10]) : (o[03] * n[10])
	vmulpd			%ymm14, %ymm4, %ymm2			# ymm2.0123 = (o[00] * n[20]) : (o[01] * n[20]) : (o[02] * n[20]) : (o[03] * n[20])
	vmulpd			%ymm15, %ymm4, %ymm3			# ymm3.0123 = (o[00] * n[30]) : (o[01] * n[30]) : (o[02] * n[30]) : (o[03] * n[30])
#
# get 4 vectors with element #1 of each row in f64mat4x4 matrix n[] (new matrix == arg2)
#
	vbroadcastsd	8(%rdx),  %ymm12				# ymm12.0123 = n[01] : n[01] : n[01] : n[01]
	vbroadcastsd	40(%rdx), %ymm13				# ymm13.0123 = n[11] : n[11] : n[11] : n[11]
	vbroadcastsd	72(%rdx), %ymm14				# ymm14.0123 = n[21] : n[21] : n[21] : n[21]
	vbroadcastsd	104(%rdx), %ymm15				# ymm15.0123 = n[31] : n[31] : n[31] : n[31]
#
# compute 2nd part of f64mat4x4 result AND add into 1st part already in ymm0 to ymm3
#
	vfmaddpd		%ymm0, %ymm12, %ymm5, %ymm0		# ymm0.0123 = (o[00] * n[00]) + (o[10] * n[01]) : (o[01] * n[00]) + (o[11] * n[01]) : (o[02] * n[00]) + (o[12] * n[01]) : (o[03] * n[00]) + (o[13] * n[01])
	vfmaddpd		%ymm1, %ymm13, %ymm5, %ymm1		# ymm1.0123 = (o[00] * n[10]) + (o[10] * n[11]) : (o[01] * n[10]) + (o[11] * n[11]) : (o[02] * n[10]) + (o[12] * n[11]) : (o[03] * n[10]) + (o[13] * n[11])
	vfmaddpd		%ymm2, %ymm14, %ymm5, %ymm2		# ymm2.0123 = (o[00] * n[20]) + (o[10] * n[21]) : (o[01] * n[20]) + (o[11] * n[21]) : (o[02] * n[20]) + (o[12] * n[21]) : (o[03] * n[20]) + (o[13] * n[21])
	vfmaddpd		%ymm3, %ymm15, %ymm5, %ymm3		# ymm3.0123 = (o[00] * n[30]) + (o[10] * n[31]) : (o[01] * n[30]) + (o[11] * n[31]) : (o[02] * n[30]) + (o[12] * n[31]) : (o[03] * n[30]) + (o[13] * n[31])
#
# get 4 vectors with element #2 of each row in f64mat4x4 matrix n[] (new matrix == arg2)
#
	vbroadcastsd	16(%rdx), %ymm12				# ymm12.0123 = n[02] : n[02] : n[02] : n[02]
	vbroadcastsd	48(%rdx), %ymm13				# ymm13.0123 = n[12] : n[12] : n[12] : n[12]
	vbroadcastsd	80(%rdx), %ymm14				# ymm14.0123 = n[22] : n[22] : n[22] : n[22]
	vbroadcastsd	112(%rdx), %ymm15				# ymm15.0123 = n[32] : n[32] : n[32] : n[32]
#
# compute 3nd part of f64mat4x4 result AND add into 1st and 2nd parts already in ymm0 to ymm3
#
	vfmaddpd		%ymm0, %ymm12, %ymm6, %ymm0		# ymm0.0123 = (o[00] * n[00]) + (o[10] * n[01]) + (o[20] + n[02]) : (o[01] * n[00]) + (o[11] * n[01]) + (o[21] + n[02]) : (o[02] * n[00]) + (o[12] * n[01]) + (o[22] + n[02]) : (o[03] * n[00]) + (o[13] * n[01]) + (o[23] + n[02])
	vfmaddpd		%ymm1, %ymm13, %ymm6, %ymm1		# ymm1.0123 = (o[00] * n[10]) + (o[10] * n[11]) + (o[20] + n[12]) : (o[01] * n[10]) + (o[11] * n[11]) + (o[21] + n[12]) : (o[02] * n[10]) + (o[12] * n[11]) + (o[22] + n[12]) : (o[03] * n[10]) + (o[13] * n[11]) + (o[23] + n[12])
	vfmaddpd		%ymm2, %ymm14, %ymm6, %ymm2		# ymm2.0123 = (o[00] * n[20]) + (o[10] * n[21]) + (o[20] + n[22]) : (o[01] * n[20]) + (o[11] * n[21]) + (o[21] + n[22]) : (o[02] * n[20]) + (o[12] * n[21]) + (o[22] + n[22]) : (o[03] * n[20]) + (o[13] * n[21]) + (o[23] + n[22])
	vfmaddpd		%ymm3, %ymm15, %ymm6, %ymm3		# ymm3.0123 = (o[00] * n[30]) + (o[10] * n[31]) + (o[20] + n[32]) : (o[01] * n[30]) + (o[11] * n[31]) + (o[21] + n[32]) : (o[02] * n[30]) + (o[12] * n[31]) + (o[22] + n[32]) : (o[03] * n[30]) + (o[13] * n[31]) + (o[23] + n[32])
#
# get 4 vectors with element #3 of each row in f64mat4x4 matrix n[] (new matrix == arg2)
#
	vbroadcastsd	24(%rdx), %ymm12				# ymm12.0123 = n[03] : n[03] : n[03] : n[03]
	vbroadcastsd	56(%rdx), %ymm13				# ymm13.0123 = n[13] : n[13] : n[13] : n[13]
	vbroadcastsd	88(%rdx), %ymm14				# ymm14.0123 = n[23] : n[23] : n[23] : n[23]
	vbroadcastsd	120(%rdx), %ymm15				# ymm15.0123 = n[33] : n[33] : n[33] : n[33]
#
# compute 3nd part of f64mat4x4 result AND add into 1st and 2nd parts already in ymm0 to ymm3
#
	vfmaddpd		%ymm0, %ymm12, %ymm7, %ymm0		# ymm0.0123 = (o[00] * n[00]) + (o[10] * n[01]) + (o[20] + n[02]) + (o[30] + n[03]) : (o[01] * n[00]) + (o[11] * n[01]) + (o[21] + n[02]) + (o[31] + n[03]) : (o[02] * n[00]) + (o[12] * n[01]) + (o[22] + n[02]) + (o[32] + n[03]) : (o[03] * n[00]) + (o[13] * n[01]) + (o[23] + n[02]) + (o[33] + n[03])
	vfmaddpd		%ymm1, %ymm13, %ymm7, %ymm1		# ymm1.0123 = (o[00] * n[10]) + (o[10] * n[11]) + (o[20] + n[12]) + (o[30] + n[13]) : (o[01] * n[10]) + (o[11] * n[11]) + (o[21] + n[12]) + (o[31] + n[13]) : (o[02] * n[10]) + (o[12] * n[11]) + (o[22] + n[12]) + (o[32] + n[13]) : (o[03] * n[10]) + (o[13] * n[11]) + (o[23] + n[12]) + (o[33] + n[13])
	vfmaddpd		%ymm2, %ymm14, %ymm7, %ymm2		# ymm2.0123 = (o[00] * n[20]) + (o[10] * n[21]) + (o[20] + n[22]) + (o[30] + n[23]) : (o[01] * n[20]) + (o[11] * n[21]) + (o[21] + n[22]) + (o[31] + n[23]) : (o[02] * n[20]) + (o[12] * n[21]) + (o[22] + n[22]) + (o[32] + n[23]) : (o[03] * n[20]) + (o[13] * n[21]) + (o[23] + n[22]) + (o[33] + n[23])
	vfmaddpd		%ymm3, %ymm15, %ymm7, %ymm3		# ymm3.0123 = (o[00] * n[30]) + (o[10] * n[31]) + (o[20] + n[32]) + (o[30] + n[33]) : (o[01] * n[30]) + (o[11] * n[31]) + (o[21] + n[32]) + (o[31] + n[33]) : (o[02] * n[30]) + (o[12] * n[31]) + (o[22] + n[32]) + (o[32] + n[33]) : (o[03] * n[30]) + (o[13] * n[31]) + (o[23] + n[32]) + (o[33] + n[33])
#
# save 4x4 result matrix to destination
#
	vmovapd			%ymm0, 0(%rdi)					# save ymm0.0123 = r[00] to r[03]
	vmovapd			%ymm1, 32(%rdi)					# save ymm0.0123 = r[00] to r[03]
	vmovapd			%ymm2, 64(%rdi)					# save ymm0.0123 = r[00] to r[03]
	vmovapd			%ymm3, 96(%rdi)					# save ymm0.0123 = r[00] to r[03]

	xorq	%rax, %rax								# rax = 0 == no error

vvm_return64:
	ret
#
# error routines
#
vvm_argument_invalid64:
	xorq	%rax, %rax								# rax = 0
	not		%rax									# rax = -1 == unknown error (argument invalid)
	ret

cascade9 · 02-26-2012, 03:08 AM

Quote:

Originally Posted by maxreason

It is my understanding that both of my modern AMD CPUs (4-core on windoze and 8-core on ubuntu) have "constant speed rdtsc clocks" that don't change speed if and when the CPUs are throttled back (to save power or reduce heat). Just for the record, the 4-core windoze CPU is clocking at 3.4GHz and the 8-core ubuntu CPU is clocking at 3.6GHz.

You cant compare a windows machine ruuning a AMD Phenom II CPU (probably a X4 965) and a linux machine running a 'Bulldozer' CPU (probably a FX-8150). You couldnt even compare them both running linux or even both running windows...

The Bullodzer CPUs are very new, less than 6 months old at this point, and are a new architecture. I'd doubt it even has proper support from 10.04, its kernel is too old.

Quote:

Originally Posted by ukiuki

Yet new processors also have not just the real cores but the virtual cores, it is so powerful that doesn't even need all the speed to process that code(unless your code force it to full), it is another thing that may be related.

'Virtual cores'? I'd take it you mean 'hyperthreading'. Which has been around since P4s (2002). AMD CPUs dont have hyperthreading at all.

Quote:

Originally Posted by maxreason

In my opinion, this should be the default operating condition of all non-battery powered PCs. Why not? The only possible proviso that I can imagine is a "quick response scenario" application that may only be triggered now and then, but must execute and respond in the minimum elapsed time possible. In that case the question becomes "how quickly are CPU speeds promoted from minimum-speed to maximum-speed when they start doing their work". If the answer is 1 second, or even 0.01 second, then probably "ondemand" isn't appropriate in those cases. But otherwise...

AFAIK, you need the actual code to get from HDD (and/or RAM) to the CPU cores before CPU frequency rises. Even though you arent looking at much time between 'give me this code', and it moving from memory to the CPU cors, that delay is still there.

BTW, IIRC forcing the CPU frequency will stop 'turbo boost/turbo core' from working. Turbo boost is found on most current Intel CPUs, turbo core is found on all AMD Bulldozer CPUs, some Fusion CPUs and Phenom II CPUs (all X6 and X4 XXX'T' models).

I think that turbo core will not be working with 10.04 and a Bullodzer CPU, unless you've got a much newer kernel than stock (and I dont even recall off hand if yuo need anything else for turbo core to work). Just to make life difficult, turbo boost and turbo core still report the highest 'non-turbo' peed even with turbo working (eg, a Phenom II X4 960T will still report the speed as 3.0GHz, even if turbo core is boosting the speed on one core to 3.4GHz). There is some program you can get to report the real turbo speed.

Quote:

Originally Posted by maxreason

1: I confirmed that "ondemand" CPU speed takes a while to increase the CPU speed, so it completely screws up benchmarks and any application that needs to respond at full speed (but might sit idle waiting for the condition that requires its quick attention).

You might be looking at a delay in reporting the speed increase.

Sorry I cant comment on your code, I'm not a programmer....I'm just another hardware monkey.

maxreason · 02-26-2012, 10:26 PM

Quote:

Originally Posted by cascade9

You cant compare a windows machine ruuning a AMD Phenom II CPU (probably a X4 965) and a linux machine running a 'Bulldozer' CPU (probably a FX-8150). You couldn't even compare them both running linux or even both running windows...

I can't compare them? Why not? That's exactly what a benchmark does, compare them. I do agree with you in one way. To do a proper benchmark (to publish in a real article somewhere) I should be a hardware monkey like you, I should have exactly the same montherboard under both CPUs, and BIOS settings should be as natural to me as breathing. Though I have similar gigabyte motherboards under both CPUs, they're not identical.

You are correct. My two CPUs are 965 and 8150.

Quote:

The Bullodzer CPUs are very new, less than 6 months old at this point, and are a new architecture. I'd doubt it even has proper support from 10.04, its kernel is too old.

I don't see how that matters in this case. But just in case, I'll run the benchmark again as soon as ubuntu64 12.04 is released in a couple months.

Quote:

'Virtual cores'? I'd take it you mean 'hyperthreading'. Which has been around since P4s (2002). AMD CPUs dont have hyperthreading at all.

I hope I didn't mention "virtual cores" in my post, because I don't even know what that term means. If it means "an extra set of registers in each CPU to support hyperthreading", then I agree with you --- I know that AMD doesn't do that like intel does. Also, I don't see how that's involved since I am only running 1 process on the system, and my process is not multithreaded. Though it will be multithreaded soon, in spades, to spread these 3D matrix multiplies and vertex transformations and other work across all 8 cores (or as many cores as the CPU has).

Yes, linux does run a bunch of background daemons, but they don't suck up much time.

Quote:

AFAIK, you need the actual code to get from HDD (and/or RAM) to the CPU cores before CPU frequency rises. Even though you arent looking at much time between 'give me this code', and it moving from memory to the CPU cores, that delay is still there.

Absolutely true. That's why my benchmark code calls my assembly language function 3 times just before it starts the timer and then calls that same assembly language function 1-million times, then recaptures the timer. That forces the code into the CPU cache immediately before the benchmark begins. However, I suspect the speed of CPUs is not upgraded to faster speed very often... perhaps once every second, or at best once every 0.001 second. Unfortunately, 0.001 second is enough to totally destroy the benchmark. Maybe a hard core hardware monkey like you would be interested in finding out how "ondemand" CPU speed is handled. I suppose it COULD be in hardware in the CPU chip, and in that case it is possible they have a way to speed up the CPU in a matter of a few machine instructions. If they do, that's perfecto!

Quote:

BTW, IIRC forcing the CPU frequency will stop 'turbo boost/turbo core' from working. Turbo boost is found on most current Intel CPUs, turbo core is found on all AMD Bulldozer CPUs, some Fusion CPUs and Phenom II CPUs (all X6 and X4 XXX'T' models).

Please explain. How exactly does this "turbo core" work? What does it do? Does it overspeed a CPU temporarily? That seems unlikely.

Quote:

I think that turbo core will not be working with 10.04 and a Bullodzer CPU, unless you've got a much newer kernel than stock (and I dont even recall off hand if you need anything else for turbo core to work). Just to make life difficult, turbo boost and turbo core still report the highest 'non-turbo' peed even with turbo working (eg, a Phenom II X4 960T will still report the speed as 3.0GHz, even if turbo core is boosting the speed on one core to 3.4GHz). There is some program you can get to report the real turbo speed.

Well, the BIOS does list a "turbo core" option that I can turn OFF or ON, so at the BIOS level it is supported. I rather doubt this requires any support at the operating system level, but I could be wrong. Often features like this can be handled either "entirely by the BIOS" (just turn it off or on in the CPU), or "the only support required can be performed by a regular old application like I have running - the one that lets me continuously display and select CPU speeds or speed-profiles like those I showed below".

Quote:

You might be looking at a delay in reporting the speed increase.

That much seems certain, since that CPU speed application has an option for how often to update its CPU speed displays. 1 second is default.

Quote:

Sorry I cant comment on your code, I'm not a programmer....I'm just another hardware monkey.

No problem --- I understand hardware (used to design CPUs), but I don't have time to keep up. Too much other work. As for the assembly language, I'm pretty sure that's about as compact and efficient as that algorithm can be made in 64-bit assembly language.

cascade9 · 02-28-2012, 12:29 AM

Quote:

Originally Posted by maxreason

I can't compare them? Why not? That's exactly what a benchmark does, compare them. I do agree with you in one way. To do a proper benchmark (to publish in a real article somewhere) I should be a hardware monkey like you, I should have exactly the same montherboard under both CPUs, and BIOS settings should be as natural to me as breathing. Though I have similar gigabyte motherboards under both CPUs, they're not identical.

You are correct. My two CPUs are 965 and 8150.

Forgive my bad wording. Sure, you can compare a X4 965 and a FX-8150. Doing that and wondering why code runs faster, or slower on one CPU compared to the other is a bit of a mistake...the FX-XXXX CPUs are totally new architrcture.

Have a look here (chosen virtually at random) to see some detailed stuff on how the 'bulldozer' FX-XXXX CPU architecture is different-

http://www.anandtech.com/show/4955/t...-fx8150-tested

Looking at benchmarks (for windows mostly, but thats what most benchmarks are done with) you'll see that the Phenom II CPUs and FX-XXXX CPUs have some strange results if you dont remember the 'new architecture' bit. Some benchmarks run faster on FX-XXXX, as you would expect, its a 'faster' chip on paper (Phenom II- 512k L2 per-core, 6MB L3. FX-XXXX- 2MB per-module (1 module is counts as 2 cores, even though there isnt 2 'full' cores), + 8MB L3). But some run slower.

IMO that is mostly due to architecture changes, which is eaxtly what has happened when architecture changes have happened in the past (eg, like when changing from a AMD K6-3 to a AMD athlon). The BIOS/UFE, etc. will also play a part.

Quote:

Originally Posted by maxreason

I don't see how that matters in this case. But just in case, I'll run the benchmark again as soon as ubuntu64 12.04 is released in a couple months.

*Shudders* Ubuntu, not my thing. *buntus never seem to benchmark as well as some other distros.

Quote:

Originally Posted by maxreason

I hope I didn't mention "virtual cores" in my post, because I don't even know what that term means. If it means "an extra set of registers in each CPU to support hyperthreading", then I agree with you --- I know that AMD doesn't do that like intel does. Also, I don't see how that's involved since I am only running 1 process on the system, and my process is not multithreaded. Though it will be multithreaded soon, in spades, to spread these 3D matrix multiplies and vertex transformations and other work across all 8 cores (or as many cores as the CPU has).

Nah, you didnt mention 'virtual cores' at all, it was ukiuki.

Quote:

Originally Posted by maxreason

Maybe a hard core hardware monkey like you would be interested in finding out how "ondemand" CPU speed is handled. I suppose it COULD be in hardware in the CPU chip, and in that case it is possible they have a way to speed up the CPU in a matter of a few machine instructions. If they do, that's perfecto!

To be honest, I've never really dug into how it does deal with frequency changes. The brief bit of digging I did hasnt helped much, I'll try to get some better information. Maybe I'll find it, maybe I wont...if I was any good at coding/programming I would check the code. My coding skills, and ability to read code are about as good as the average idiot though......

Quote:

Originally Posted by maxreason

Please explain. How exactly does this "turbo core" work? What does it do? Does it overspeed a CPU temporarily? That seems unlikely.

Thats exactly what it does.

Quote:

Originally Posted by maxreason

Well, the BIOS does list a "turbo core" option that I can turn OFF or ON, so at the BIOS level it is supported. I rather doubt this requires any support at the operating system level, but I could be wrong. Often features like this can be handled either "entirely by the BIOS" (just turn it off or on in the CPU), or "the only support required can be performed by a regular old application like I have running - the one that lets me continuously display and select CPU speeds or speed-profiles like those I showed below".

This is what I get when I dont double-check stuff. In my defence, I havent built a FX-XXXX system for anybody, and its when I build a sysem that I tend to do the deeper digging to know this stuff properly. The Phenom II X6/X4 'T' CPUs I've worked on have always seem to run properly, and they were windows machines anyway, so I didnt check the linux support.

Intel turbo boost did have some issues on release with linux, due to the software, but there seems to be more software controls on turbo boost (e.g., it will actually turn cores off, unlike turbo core). I'd stupidly assumed that turbo core would be the same.....turbo core seems to be controlled totally by the chip-

Quote:

Turbo Core kicks in when 3 or more cores (on a 6-core part) are idle. When this happens, the frequency of those three cores is reduced to 800MHz, the voltage to the entire chip is increased, and the remaining three cores are turboed up by as much as 500MHz. It doesn’t get any more granular than this. If you have 3 or more cores idle, then the remaining turbo up. In any other situation the CPU runs at its normal clocks.

The CPU handles all monitoring and does the clock/voltage management itself. The switch to turbo up cores apparently happens fast enough to deal with Windows moving threads around from core to core.

http://www.anandtech.com/show/3641/a...o-core-enabled

With FX-XXXX, its got 'turbo core 2', not turbo core 1 like the Phenom IIs. A bit different-

Quote:

The Bulldozer CPUs feature Turbo Core 2.0 (Turbo Core was originally introduced with the Phenom II family). Turbo Core 2.0 allows boosting active cores on the Bulldozer processor. If less than half the cores are being utilized, a "max turbo mode" is used on all stressed cores. However, if all cores are being pushed to their limits, Turbo Core is activated but at a lower frequency than the maximum. This auto-overclocking is done automatically when needed.

http://www.phoronix.com/scan.php?pag...features&num=1

Running a FX-8150, single-core only (up to 4 cores), you would be getting 'max turbo mode'. So it would be running at 4.2GHz, not 3.6GHz. If you are running 5+ cores, you would only get 3.9GHz. Even though the listed max frequency is 3.6GHz.....

maxreason · 02-28-2012, 02:23 AM

Thanks for the links. I read some bulldozer articles a few months ago, but now I understand a little more and better.

One nice thing about the way I'm measuring performance is that I'm measuring in "full-speed clock cycles" and not slower cycles if the core is somehow slowed down. At least that's what I've seen stated for the instruction I execute to capture this "clock count". Therefore, when I compare a CPU with 3.0GHz cores to a CPU with 3.6GHz cores... or any other speeds... I should be getting an accurate comparison of how they would perform IF they were both running at the same clock rate.

When I add code to spread work across multiple cores, I'm going to make a special effort to run these math-intensive processes in parallel on cores 0,2,4,6 or cores 1,3,5,7 to avoid overloading the floating-point math units. Hopefully I have non-math intensive work I can perform in parallel on the rest of the cores, otherwise the bulldozer architecture will not be optimal.