Good afternoon.
I have a Beowulf cluster (CentOS 6.9) with 15 racks/nodes (I will call n0-n14, with the head node being n-1). For some reason, n3, n4, n6, n9, n10, 11, n13, n14 would shutdown (not gracefully) and be in a power saving state in the middle of normal use. I can reboot all of these nodes with no issues, but, within 10 mins. These particular nodes would always go down, one-by-one in no particular order. I cannot find a reason why in the head node's log file, except connection loss to the node at the time of shutdown. In the BIOS of each node, a power event for an unexpected shutdown is noted. I have tried to power each node up alone, however, the nodes listed above would always shutdown.
Nodes n0-n4, n5-n9, n10-n14, are on their own UPS. To check if this was a UPS related issue, I powered up a node, n3, into another room that is on a different breaker, without a UPS, separated from the cluster and waited for a few minutes while I used a VGA monitor and USB keyboard to monitor the rack as it waited in the BIOS screen. After a 1 min and 20ish seconds, the node shuts down and goes into standby mode again.
I feel confident that this is a hardware issue and not related to the cluster.
I am going to try to pull ram and re-seat them again.
I believe it is strange for it to be a power supply issue since these nodes are scattered across 3 different UPS. I have another cluster, exact same age, software config, hardware config, chassis, UPS etc. (except using Intel Xeon processors instead of AMD Opteron) and there are no issues at all. I also have a 3rd cluster with half with even old hardware and half with newer hardware.
Has anyone ever run into a problem like this and could anyone advise me on the proper steps I should take to diagnose this?
Right now I am leaning towards purchasing a new power supply anyway and trying it out on one of the failing nodes.
This has been going on since before I took over these clusters (about a year ago). I asked the previous sys admin multiple times for details and troubleshooting he has done and I could never get a straight answer/story from him.

Oh, and documentation? pfft.

I was told to boot the nodes up and keep using them until they died, and then keep rebooting them.