LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   General (https://www.linuxquestions.org/questions/general-10/)
-   -   Strange (but fatal) recurring rackmount problems (https://www.linuxquestions.org/questions/general-10/strange-but-fatal-recurring-rackmount-problems-4175437928/)

rylan76 11-20-2012 05:03 AM

Strange (but fatal) recurring rackmount problems
 
Hi Guys

I'd like some input... here's the situation.

I had a normal desktop box in a normal tower case on which I setup a Centos 6 DHCP, DNS and a Samba PDC.

We have a rack mount setup at work which already contains 12 other servers. All are built into rack mount trays.

I had the board, hdd and power supply built into a standard rack mount tray by an external provider, and they installed the tray into the rack for me. The tray is the exact same as the trays that contain our 12 other servers.

So I started the machine, and all was fine - it was doing DHCP, DNS and PDC duties. I tested it for several hours, then went home at end of business.

Came in the next morning and it was dead. Pulled the tray, had it taken apart, and it had melted - you can actually see where the traces on the motherboard melted and flowed together. The power supply is fine, had it measured and it is outputting as it should. So...

Replaced the motherboard, put in another power supply, CPU, and HDD with a different model. Reinstalled Centos, re-setup the DHCP, DNS and PDC servers. Had it all installed in the same physical tray.

Went home.

Came back, melted a SECOND time. Same parameters, motherboard totally destroyed, CPU gone, and HDD dead.

Other 12 machines are fine and running 100%. All the network switches and routers mounted in the same rack also fine.

Only factor is the tray itself, and the rack - I went over it with a fine tooth comb, there are no projections or irregularities - it is properly spaced, so it appears not to be a short-to-case or something similar.

Thing is as well, it WORKS fine for about 12 or 14 hours, but leave it anything longer than 24 hours and hardware in that tray is promptly destroyed.

It is getting quite expensive... any ideas what I can try / do? All that is left to change is the tray itself, but if the tray is the culprit, why fail after an indeterminate amount of time - not immediately, if it is a short or something similar?

It is a properly stabilized server room with an ambient temp of about 16 deg C and stabilized, protected powersupply with auto-start generator backup. There have been no electrical events anywhere nearby, no need to fall back to generator, or any other salient events. All the other machines (even the one in the adjacent tray, about 20 centimeters lower, vertically) are fine and running 100%.

Any ideas or comments? What the flaming h...l could be going on that keeps smoking the hardware I try to add to the rack?

NyteOwl 11-20-2012 02:23 PM

Very simply sounds like a simple problem of rather massive overheating.

Have you checked all the heatsink and case fans, as well as the ventilation path in the rack itself?

If this is the top unit in the rack, you need to ensure it's not being unduly heated by the units below it. Improper air circulation can turn a rack cabinet into a small blast furnace and the top unit gets the brunt of the hot air flow.

rylan76 11-21-2012 03:57 AM

Hi!

Thanks for the reply. Yes, this unit is the top unit in the rack.

I've not put a max / min thermometer into the server room. Ambient perceived temp is quite cool (I've stood in there and it feels about 21 deg C on the skin).

Hmm - top unit - I'll mention that to my manager. though it doesn't seem to get too hot.

Anyway, thanks for the pointers and taking the time to respond!

Kind regards

Pastychomper 11-21-2012 07:24 AM

Whatever's going on, it's certainly impressive. Could replace the motherboard and other components with a frozen pizza? If it cooks you'll have ruled out any electrical short in the case, as well as finding a good use for the heat.

H_TeXMeX_H 11-21-2012 11:37 AM

It sounds unlikely. What kind of CPUs were installed ? Most are able to throttle down or at least cut power before melting down. Can you identify the source of the meltdown ? Maybe it wasn't the CPU...

NyteOwl 11-21-2012 04:52 PM

The ambient in the server room may be irrelevant if the case is in a rack enclosure. In this case the heat gets contained within the rack.

I admit, even in the hottest rack cases, I have never seen one get hot enough to reflow the solder on a circuit board.


All times are GMT -5. The time now is 02:02 PM.