Hi everybody,
First some background information. We have a blade setup with 10 Dell PowerEdge 1955 blades, all configured exactly the same. They all run on an untouched, fresh Debian Squeeze install. The blades all have two quad cores (xeon E5345), all servers run the same kernel: Linux 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux.
We've written clustered index building software which runs on these machines, this software is already live on a production environment on another set of blades (also all the same model) and it works correct.
On the new blades, we noticed a very big performance problem, caused by a single blade. Upon further analysis, I saw the particular blade causing the problems had a huge amount of interrupts. I wrote a script to analyze the interrupt counts, recording the number of interrupts in the last 5 seconds. Here is an extract from the output I got when our software is running:
Quote:
normal blade, localtimer interrupts:
6987
7219
7168
7031
6884
7166
6846
6699
7416
7018
|
Quote:
problem blade, localtimer interrupts:
20292
15861
17124
15181
18748
25346
15790
15386
14714
16959
|
Quote:
normal blade, rescheduling interrupts:
27
30
37
50
5
23
26
35
18
21
|
Quote:
problem blade, rescheduling interrupts:
4281
5139
8334
4908
5115
4492
4920
5972
5268
5596
|
Here I'm stuck however: I'm not a kernel hacker and don't know how to debug / analyse / test this further. How to see what's really going on here, what are my options to test? Are there certain kernel parameters to tweak? Could it be faulty hardware, and if so, how to determine what is broken?
Any help would be very welcome, I can post more details if you need to know anything else.
[EDIT]
I followed directions from this document I found:
https://help.ubuntu.com/community/Re...lingInterrupts, tried all kernel parameters there (acpi=noirq, acpi=off, noapic and nolapic) but this didn't change anything. Unfortunatly, I don't have access to the BIOS now... Are there other options to try?