Troubleshooting random reboots, Debian 6 kernel 3.2, Xeon E5-1650
I've built a new server (specs below) to ship off to my datacenter for co-lo, to replace a rock-solid but older piece of hardware.
The new server is randomly rebooting. I could use some advice on narrowing it down.
I've built hundreds of servers in my former line of work, this is not my first dance -- but these days I work for myself (trading) and don't have piles of hardware laying around like I used to at the office job. So I can't just throw down a new set of hardware and see if problem goes away.
SuperMicro SuperServer 5027R-WRF
Intel Xeon E5-1650
64GB Kingston DDR3 1600 ECC REG CL11 1.5V (2 x KVR1600D3D4R11SK4/32G)
4xWD 500GB RE4 WD5003ABYX
Kernel 3.2.0 (installed from backports)
I have hammered on this system hard with everything passing with flying colors. I was going to ship it off to the datacenter within a few hours, and then out of the blue it rebooted. With no load, nothing going on, zero --- I heard it reboot from the other room.
This is the second time the system has randomly rebooted. The first time was the very first time I ran sysbench on it as a burn-in test. I chalked it up to a sysbench problem because it rebooted within seconds of me pressing enter on the test. I proceeded to hammer on sysbench for days with huge load and it was completely stable.
Then the other day it was just idle with nothing going on, and bam - reboot.
I checked all the logs, nothing. There is no kernel panic. There is absolutely zero to go on. This would seem to point to hardware, but yet I have pounded on this system for the last couple weeks and have never had a single problem.
I can't ship this to the datacenter with a random reboot problem.
It has passed over 5 passes on memtest. I will let it go to ten, that alone will take another couple days.
Things I have thought of:
1) Reduce to 1 stick of memory and see if problem goes away. Well problem with this is that with the problem manifesting only randomly, and only every couple weeks, trying to narrow it down to memory (8 sticks) could take months. Plus they pass memtest.
2) Backup current system (clonezilla) and re-install a different kernel or distro, and see if problem manifests itself. Again, this is really not ideal because I want to run Debian 6. I could revert to 2.6.32 kernel but there were some nice speed improvements with kernel 3.2.0. And how long do I test on 2.6.32 kernel before I feel assured the problem was with 3.2.0? Weeks? Months? All the while, this server is sitting in my house instead of at the co-lo, so costing me money.
Other suggestions? I checked /proc/sys/kernel/panic and it was 0, which means it should not automatically reboot during a panic (plus there is no log anywhere indicating a panic).