[SOLVED] ACPI related kernel (2.6.34) freeze on Nehalem/Westmere CPU, on a SuperMicro MB

junping2000 · 10-12-2010, 03:26 PM

I have seen some really weird freeze of kernel from 2.6.32 to 2.6.34 on Westmere
CPUs. The exact cause (CPU, Kernel or Hardware) is not known, so I am going to
list the facts:

** When the crash happens, there is __nothing__ on the console. Not on
video console and not on serial console.

** The said machine has two Westmere CPUs, each with 4 cores. With
hyperthreading, "cat /proc/cpuinfo" will show 16 cores.

** I can usually repeat the crash by copying a couple of terabytes of
date to this machine.

** Adding "acpi=off" in boot parameter list "fixed" the problem, but
that also turns off the HT, so I lose half of the core counts.

** I tried "acpi=ht", which is due to be out in later kernel, it doesn't
work, the machine still crashes. My understanding is that "acpi=ht"
will turn on hyperthreading and the minimum ACPI support to get
hyperthreading supported. But that doesn't seem to be good enough.

** On this particular board, lmsensor logic is broken. So I used
"supero doctor" from SuperMicro to monitor the CPU temperature and
they have been normal up till the crash, so I ruled out the
overheating theory.

** Forcing "performance" scaling governor over the default "ondemand"
governor effectively disabled the CPU freq scaling, but it doesn't
solve the problem. The machine still crashes.

** Just to verify the ACPI DSDT is not corrupt (from Linux's
perspective). I installed Intel's ASL compiler (iasl) and did
cat /proc/acpi/dsdt > /tmp/dsdt.dat
iasl -d /tmp/dsdt.dat > /tmp/dsdt.dsl
iasl -tc /tmp/dsdt.dsl
The recompilation didn't generate any error. So I doubt the
kernel parsed that wrong either.

** We also have some machines running E5430 (Harpertown) instead
of Westmere (E5630). 2.6.34 runs fine on those without any
issues. But since 2.6.34 doesn't load the same acpi modules,
the difference might be moot.

** We have three Westmere machines here and I crashed them all with
the same tests. Our supplies is known to have quality parts, so
I would rule out spotty power supplies.

Here are some output from the system in case it helps to explain:

[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.34.6-54.fc13.x86_64 #1 SMP Sat Sep 11 15:28:03 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]#

[root@localhost ~]# cat /proc/cpuinfo
...
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5630 @ 2.53GHz
stepping : 2
cpu MHz : 2533.798
cache size : 12288 KB
physical id : 1
siblings : 4
core id : 10
cpu cores : 4
apicid : 52
initial apicid : 52
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida arat tpr_shadow vnmi flexpriority ept vpid
bogomips : 5065.93
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

[root@localhost ~]#

Keep in mind that most of the people who had ACPI problems had
problem booting up. In my case, the boot up went just fine, and the
crash is very random. If feels like it would happen when the system
is down-scaling, and I suspect the voltage down-scaling that shut
down the whole system.

This is the first time I have seen a linux crash without a trace
(panic). To be honest, I am still 50-50 on whether it's a hardware
issue (Westmere, SuperMicro MB) or kernel issue.

Next I am going to turn on the debugging logic in ACPI sub-system in
the kernel. I will update you all of what I find. In the meantime,
I would appreciate any help or ideas to try out.

Thanks,

- Junping

H_TeXMeX_H · 10-13-2010, 12:49 PM

Spammer reported.

As for the issue, it's really hard to diagnose these things. I was getting some similar instabilities and they suddenly disappeared with a newer kernel.

For now I would just try some other boot options if nothing else, they may not be directly related, but you should try them anyway:

Code:

nolapic
noioapic
noapic
pci=nomsi

ULA99 · 10-22-2010, 02:01 PM

I am also observing a similar problem on a HP Z800. If I run some heavy numerical calculation the system reboots within one hour or so.
I am trying acpi=off and it seams to work for now.

Please let me know if you figure out the source of the problem.

Thanks

ULA99

$uname -a
Linux localhost 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:05:27 UTC 2010 x86_64 GNU/Linux

$ cat /proc/cpuinfo

...

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping : 2
cpu MHz : 2393.971
cache size : 12288 KB
physical id : 1
siblings : 4
core id : 10
cpu cores : 4
apicid : 52
initial apicid : 52
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid
bogomips : 4788.13
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

junping2000 · 10-25-2010, 07:58 AM

I tried a lot of approaches lately. Since my box usually die during data transfer from network, I used
acpi.power_nocheck as boot param and even modified the kernel code to skip that to make sure the network driver
is not put into wrong power mode.

I managed to crashed all 2.6.34.* kernels with ACPI on with
this. I really start to think my problem is somewhat
hardware related, and turning ACPI off just make it much
less likely to happen. I was not able to crash 2.6.34.7
with all ACPI debugging & tracing on, and
"echo 0xffffffff > /proc/acpi/debug_level"
but that might just be because the timing is different.

Lately, I have testing the latest stable (2.6.36) and I
have KDB built in and hope to catch the crash that way.
It has been going for 2 days now. If the crash doesn't
happen, I will try to test 2.6.36 without KDB.

Again, the hardest part here is the console (be it virtual
or serial) is blank when this happens, so there is very
little to go with.

AGeek · 10-26-2010, 09:48 AM

Maybe this would help you: for disabling the blanking of console, I use this:

setterm -powersave off -blank 0

Founded on http://www.cyberciti.biz/tips/linux-...ing-blank.html

H_TeXMeX_H · 10-26-2010, 12:18 PM

If absolutely nothing works, it could be a BIOS bug, check for newer BIOS with something about ACPI in the changelog.

junping2000 · 10-27-2010, 03:55 PM

Thanks for all the replies.

Since the crash happens in kernel (not user space, not X), setterm won't help. Thanks for the suggestion on BIOS.

I spent some time researching all the BIOS options. There are a lot! My
box (X8DTN+) is running AMI v02.68, and the latest one I can get from
SuperMicro is 2.0b, I guess that means the series. I couldn't find
any ChangeLog, so no go there.

While researching the mapping between Intel C-state and ACPI C-state, more for BIOS option
C3 State ===> set to "ACPI C2"
I came across this link
http://www-947.ibm.com/support/entry...andind=5000008

So I just turned off Intel C6 from the BIOS and set "C State Package
Limit Setting" to C3 (from "Auto"). So now the deepest sleep it can
do is "C3" (ACPI C2). I am re-running my tests now.

codedr · 10-29-2010, 10:57 AM

Jumping2000: You say you get the crash 'copying terabytes' over
the network. Can you be more specific?
I am looking for information like the following:
How fast is your network ? 1Gb ? 100Mb ?
Are you using rsync over ssh, nfs, or something else ?
Is all a bunch of small files, large files, or one big file ?
How many files, avg size and std deviation ?
If you can, tell me how I could reproduce it.

ULA99: Can you describe your 'heavy numerical calculation' ?
Is it single process, multi-process, or multiple threads ?
Can you share the source code or describe the procedure ?

I too am facing a problem, but I discovered it during boot up.
If I had a runtime test that created a similar failure, I think
I could diagnose it easier.

codedr · 10-29-2010, 05:01 PM

I found an issue with the module 'preloadtrace' that causes westmere to
crash frequently. If you have this module loaded, I recommend that you remove
the package that installs it.

ULA99 · 10-30-2010, 12:59 AM

codedr
The heavy numerical calculation consisted in a continuous evaluation of a multi-threaded fft. The size of the data array doesn't really matter, and the average load over all 8 cores was ~90%.

I also run the same calculation on windows 7, and it did not produce any errors.

The bios I am using (the latest version available) doesn't have a setting for "C State Package Limit Setting", however according to HP it should already contain a fix for the Westmere C6 state transition bug.

Possibly that bios fix doesn't work with linux 2.6.32 kernel. I also tried 2.6.35 with the same results.

Finally I tried verify the limiting settings in

/proc/acpi/processor/CPU0/power

but the file did not exist. Also powertop was indicating that the processors were still using C3, even after trying the boot option processor.max_cstate=1

junping2000 · 10-31-2010, 08:29 AM

In my test, I can usually crash the box within a day by using
"rsync -e ssh" data from the other box to this Westmere box.
It's over Gigabyte network.

I tried the heavy load approach early on by running 20-30
"openssl speed" to load up all 16 cores (from 2 CPU) all
the time, but the problem didn't happen. I suspect the
problem will not appear in heavy load, but rather in C-state transition.

Right, the "processor.max_cstate=0" alone won't do it. I
looked at the kernel source and use the above with
"intel_idle.max_cstate=0", which totally disabled the
intel idle driver. After that, your powertop will look
crippled.

The Intel's own errata is here
www.intel.com/Assets/PDF/specupdate/323372.pdf

So far I can still crash the box even with minimum ACPI
setting.

AGeek · 11-02-2010, 08:12 AM

I suggested to use setterm to prevent blanking of text console with ACPI debug traces options turned full on ('echo 0xffffffff > /proc/acpi/debug_level') but obviously you will have to redirect all kernel message on console (using 'echo 8 > /sys/kernel/printk') and not use X11, so you would be able to see if your hangup is implicating something in the ACPI 'stack' (like a transition for power management )...

Acpi.debug_layer should have at least ACPI_HARDWARE (see drviers/acpi/debug.c in a kernel tree) so 0x00800002 to see power management transitions. Maybe ACPI_PROCESSOR_COMPONENT would be usefull too.

Just an idea...

junping2000 · 11-03-2010, 10:18 AM

This is a server, I never ran X on it, so when I say the console is frozen, it's really frozen. In the early post, I
mentioned that I turned on all ACPI debug options in kernel and used "echo 0xffffffff > /proc/acpi/debug_level", it didn't
crash in one run, but I suspect it's the change of timing and it's merely masking the problem.

I got the box not to crash by doing a cocktail with 2.6.34.7 (our production kernel):

1. Modified the kernel to fix one possible uninitialized var in acpi_pad.c
2. Configure the kernel to turn off ACPI Sleep (# CONFIG_ACPI_SLEEP)
3. After looking at the acpi modules under drivers/acpi/, I am using the following boot option
processor.max_cstate=1 processor.nocst=1 intel_idle.max_cstate=0 acpi.power_nocheck=1 pci=noacpi thermal.off=1
Basically, I want to turn almost all ACPI off without turning off hyper-threading. In case you didn't read the
previous emails in the thread, I tried "acpi=ht" with no luck. If I use "acpi=off", the machine doesn't crash
but I lose half the cores.

I am re-doing the testing and make sure it doesn't crash. If it doesn't, I will unwind the options and see which one
really matters. My manager also got me the contact for MB manufacture, I will send him my findings and see if his
other customers experienced the same problem.

AGeek · 11-05-2010, 04:09 AM

Sorry if I did not explained it right: My idea was to setup the text console so when a freeze occur, you will know if and which ACPI function was called just before the freeze...

At least, you have a screen to do the boot process and enter kernel option? Or do you mean you do not even have a screen at all? In this case, you will have to boot using a serial console (console ttySn,xxx or ttyUSB0,xxx) and setup a PC to receive your kernel log message to make my test work. Of course, if you can plug a screen during the test, this will be much easier to set up:

echo 8 > /proc/sys/kernel/printk # this will make all kernel msg go to the active text console or serial console (sorry for the mistake in my previous post...)
setterm -poweroff 0 ....

If
echo 0xffffffff > /proc/acpi/debug_level
make the problem disappear, as you say because of timing issue (like ACPI_LV_INTERRUPTS which pollute dmesg fast enough), I would try with

echo 0x00200004 > /proc/acpi/debug_level

for just ACPI_LV_FUNCTIONS & ACPI_LV_INFO with

echo 0x00800002 > /proc/acpi/debug_layer

for just ACPI_HARDWARE, ACPI_POWER_COMPONENT, ACPI_PROCESSOR_COMPONENT (see cat /sys/module/acpi/parameters/debug_layer)

You should have some periodic message like:

Code:

  
  hwregs-0186 [02] hw_register_read      : ----Entry
 hwvalid-0132 [03] hw_validate_io_request: ----Entry
 hwvalid-0161 [03] hw_validate_io_request: ----Exit- AE_OK
  hwregs-0245 [02] hw_register_read      : ----Exit- AE_OK
  hwregs-0186 [02] hw_register_read      : ----Entry
 hwvalid-0132 [03] hw_validate_io_request: ----Entry
 hwvalid-0161 [03] hw_validate_io_request: ----Exit- AE_OK
  hwregs-0245 [02] hw_register_read      : ----Exit- AE_OK
 hwvalid-0132 [02] hw_validate_io_request: ----Entry
 hwvalid-0161 [02] hw_validate_io_request: ----Exit- AE_OK
 hwvalid-0132 [02] hw_validate_io_request: ----Entry
 hwvalid-0161 [02] hw_validate_io_request: ----Exit- AE_OK

And then make the computer run without using the text console during some time until the freeze append in which case maybe it would display which ACPI driver function was responsible...

An other question: do you have X86_BIGSMP set in your kernel? because I could not even boot correctly on my configuration (which has the same bi Xeon processor but an Intel motherboard, and run fine now with ACPI full on...) unless I did recompile with this options. BIGSMP is needed when there is more than 8 processor... And with HyperThreading, you have 16... but then your last message seem to have make it run?

junping2000 · 11-22-2010, 08:53 AM

Turns out our hardware supplier packages an older version
of SuperMicro BIOS (v2.0) in their version of BIOS. The
latest SuperMicro v2.0b solved the problem.

(See my comment #7 for my hardware, different bios versions).

Our supplier also browsed the ChangeLog for 2.0b and didn't
see anything glaring. A couple of ACPI changes mentioned
don't look consequential. I was told that the appliance
with the v2.0 BIOS was certified to run FC9 and FC10, so
it's likely the later kernel might have explored more ACPI
features and in turn exposed the problem.