LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 01-17-2019, 03:20 PM   #1
riped01
LQ Newbie
 
Registered: Aug 2018
Location: US
Distribution: CentOS 6.9
Posts: 15

Rep: Reputation: Disabled
Altus 1650 1U Server Chassis(s) shutdown randomly


Good afternoon.
I have a Beowulf cluster (CentOS 6.9) with 15 racks/nodes (I will call n0-n14, with the head node being n-1). For some reason, n3, n4, n6, n9, n10, 11, n13, n14 would shutdown (not gracefully) and be in a power saving state in the middle of normal use. I can reboot all of these nodes with no issues, but, within 10 mins. These particular nodes would always go down, one-by-one in no particular order. I cannot find a reason why in the head node's log file, except connection loss to the node at the time of shutdown. In the BIOS of each node, a power event for an unexpected shutdown is noted. I have tried to power each node up alone, however, the nodes listed above would always shutdown.
Nodes n0-n4, n5-n9, n10-n14, are on their own UPS. To check if this was a UPS related issue, I powered up a node, n3, into another room that is on a different breaker, without a UPS, separated from the cluster and waited for a few minutes while I used a VGA monitor and USB keyboard to monitor the rack as it waited in the BIOS screen. After a 1 min and 20ish seconds, the node shuts down and goes into standby mode again.

I feel confident that this is a hardware issue and not related to the cluster.
I am going to try to pull ram and re-seat them again.

I believe it is strange for it to be a power supply issue since these nodes are scattered across 3 different UPS. I have another cluster, exact same age, software config, hardware config, chassis, UPS etc. (except using Intel Xeon processors instead of AMD Opteron) and there are no issues at all. I also have a 3rd cluster with half with even old hardware and half with newer hardware.

Has anyone ever run into a problem like this and could anyone advise me on the proper steps I should take to diagnose this?
Right now I am leaning towards purchasing a new power supply anyway and trying it out on one of the failing nodes.

This has been going on since before I took over these clusters (about a year ago). I asked the previous sys admin multiple times for details and troubleshooting he has done and I could never get a straight answer/story from him. Oh, and documentation? pfft. I was told to boot the nodes up and keep using them until they died, and then keep rebooting them.
 
Old 01-21-2019, 09:43 AM   #2
riped01
LQ Newbie
 
Registered: Aug 2018
Location: US
Distribution: CentOS 6.9
Posts: 15

Original Poster
Rep: Reputation: Disabled
I've swapped around RAM and tested it. I've removed all but 1 stick of RAM and swapped out that RAM for brand new RAM. I've taken a power supply from a working node and swapped it with n3. The problem still persists. Could this be a thermal issue?
I've attached some images (as a pdf) of the BIOS logs. Between events 4 & 5 is when n3 shutdown and went into standby. The only clue is that before the n3 shut down, sensor number 49 triggered the event. But I do not know how to correlate the sensor number from a BIOS to a board.

Here is a PDF that has the images in it: https://docdro.id/dY4hm93
 
Old 04-22-2019, 02:47 PM   #3
riped01
LQ Newbie
 
Registered: Aug 2018
Location: US
Distribution: CentOS 6.9
Posts: 15

Original Poster
Rep: Reputation: Disabled
The current 1U sever PS are the "Delta 600w Switching Power Supply TDPS-600AB-B". I eventually caved in and purchased 2 power supplies to try from eBay. Unfortunately, I can only find the TDPS-600AB-A models on eBay. I've contacts Delta Electronics for a replacement and if there is a difference between the ABB and ABAs. My request was too small for them to manufacture new PS and the rep could not tell me what the difference is between ABB and ABA.

When the PS came in, I popped open the ABA and my old ABB (assumed to be the cause of this overall issue) power supply to compare. As far as I could tell, they're identical except on the two boards, someone (I am assuming the manufacturer) wrote with a marker ABB (for the ABB unit) and ABA (for the ABA unit) on the boards. (I can post pictures if requested).

I replaced the PS in n3 with the ABA PS and ran an expensive calculation on it for 48 hours. The node, which should've went down in 10 minutes, survived those days and gave me the correct answers from the calculation. I have purchased more ABA PS and replaced all the PS of the nodes that were giving me trouble. This happened 03/01/2019-03/28/2019. Till this day, I have not had a node go down for any reason. Perhaps some kind of ark or something that killed <50% of the PS in the cluster?

Now, going from ABB to ABA did introduce a few oddities. Two of them are:
1) The LED HDD indicators on the Altus 1650 chasis are all blinking red which means that the HDD/raid are re-building, however, the raid is already built and calculations are running on them, no problem!

2) Every once in a while a job hits a specific node and I get the following stream to terminal (here is an example with n0):

Quote:
Message from syslogd@n0 at Mar 18 15:38:20 ...
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Message from syslogd@n0 at Mar 18 16:53:20 ...
[Hardware Error]: CPU:4 MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc5a4000fd080813

Message from syslogd@n0 at Mar 18 16:53:20 ...
[Hardware Error]: #011MC4_ADDR: 0x00000008dd2b99d0

Message from syslogd@n0 at Mar 18 16:53:20 ...
[Hardware Error]: Northbridge Error (node 1): DRAM ECC error detected on the NB.

I know that this might be due do the PS because this error message originally came from n3. I swapped the PS from n0 (ABB, working, perfect, no issues) with n3 (ABA, from eBay). Now n3 doesn't give me these error messages, but n0 does. n6 is the other node that gives me this message as well. It looks like an ECC error, but I am not sure.


Technically this tread is solved. ^_^

Last edited by riped01; 04-22-2019 at 02:54 PM.
 
  


Reply

Tags
node, power, server, shutdown


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Penguin Computing Announces New Altus Linux Servers With Next-Generation AMD Opteron(TM) Processors LXer Syndicated Linux News 0 08-15-2006 11:12 PM
rh9 install problem on Dell Poweredge 1650 netmechDB Linux - Software 1 05-12-2005 09:34 AM
xsane stopped working with Epson Perfection 1650 after earlier success acampbell Linux - Hardware 6 02-11-2005 05:05 AM
Port 1650 LQtoto Linux - Security 2 04-27-2004 11:33 PM
epson 1650 and xsane help please kafnir Linux - General 2 12-04-2002 07:35 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 07:57 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration