localtimer and rescheduling interrupts going through the roof

iluvatar · 08-09-2012, 04:47 AM

Hi everybody,

First some background information. We have a blade setup with 10 Dell PowerEdge 1955 blades, all configured exactly the same. They all run on an untouched, fresh Debian Squeeze install. The blades all have two quad cores (xeon E5345), all servers run the same kernel: Linux 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux.

We've written clustered index building software which runs on these machines, this software is already live on a production environment on another set of blades (also all the same model) and it works correct.

On the new blades, we noticed a very big performance problem, caused by a single blade. Upon further analysis, I saw the particular blade causing the problems had a huge amount of interrupts. I wrote a script to analyze the interrupt counts, recording the number of interrupts in the last 5 seconds. Here is an extract from the output I got when our software is running:

Quote:

normal blade, localtimer interrupts:
6987
7219
7168
7031
6884
7166
6846
6699
7416
7018

Quote:

problem blade, localtimer interrupts:
20292
15861
17124
15181
18748
25346
15790
15386
14714
16959

Quote:

normal blade, rescheduling interrupts:
27
30
37
50
5
23
26
35
18
21

Quote:

problem blade, rescheduling interrupts:
4281
5139
8334
4908
5115
4492
4920
5972
5268
5596

Here I'm stuck however: I'm not a kernel hacker and don't know how to debug / analyse / test this further. How to see what's really going on here, what are my options to test? Are there certain kernel parameters to tweak? Could it be faulty hardware, and if so, how to determine what is broken?

Any help would be very welcome, I can post more details if you need to know anything else.

[EDIT]
I followed directions from this document I found: https://help.ubuntu.com/community/Re...lingInterrupts, tried all kernel parameters there (acpi=noirq, acpi=off, noapic and nolapic) but this didn't change anything. Unfortunatly, I don't have access to the BIOS now... Are there other options to try?

sundialsvcs · 08-09-2012, 01:38 PM

Take the blade offline and replace it with another one. If the problem goes away, consider the problem solved. (Because, in fact, it is.)

My best-guess is that something is preventing timer interrupts from being serviced timely, or maybe the real-time clock is screwed. Or maybe the system is being otherwise flooded with interrupts.

Doesn't matter, because your goal is to get production done. There's gonna be zero return-on-investment for you futzing around with it.

If replacing this blade without figuring out why solves the problem, don't bother to figure out why. Send it back to the manufacturer and ask for another one. If you're a good customer with a good service rep, you'll get it.

rew · 09-28-2012, 05:54 AM

I'm having a similar problem. My workstation encounters lots of interrupts.
About 120 thousand to 150 thousand per second. i.e. a lot more than TS here....

It's just that I thought my system would be more or less idle (with a few hundred interrupts per second, max) when I wouldn't touch it.

sundialsvcs · 09-28-2012, 07:33 AM

Same recommendation. Call the vendor and tell them to bring you another one. They can go home and figure out why it's busted on their own time. If you've got ten supposedly "identical" computers, all running the same software, and "one" of them is the odd-man out, "it's hardware. Gotta be." And therefore, it's someone else's job to figure out why. They can bring you a rental car for the interim.