LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Memory issues with Dell Poweredge servers (https://www.linuxquestions.org/questions/linux-hardware-18/memory-issues-with-dell-poweredge-servers-4175697793/)

ccj4467 07-15-2021 03:29 PM

Memory issues with Dell Poweredge servers
 
Hi all, need a little advice

The project I am working has a bunch of Dell Poweredge servers. We mostly administer these servers remotely. One of the servers a R330 consistently fails to reboot. Someone has to go to the server room and either power the system down manually or re-seat the DIMMS to get the server back up and running. Today another server a R720 exhibited the same problem.

My question is, would taking the DIMMS out and cleaning all the contacts(DIMM and slot) correct this problem?

Has anyone else ever come across problems like this?

If it matters we are running Ubuntu 20.04 server.

michaelk 07-15-2021 07:30 PM

As someone that has worked with electronics and avionics for a long time I've seen some some strange things happen and also illogical fixes...

Typically memory modules and their slots are gold plated so oxidation should not build up if manufactured properly. If the server room is air conditioned and kept at the proper humidity level something like this should not repeatedly happen if at all.

I have purchased many Dell computers over the years and have never had to clean the contacts. At least for me the PSU dies or typically the capacitors on the motherboard go bad. By that time they have outlived their usefulness and time to buy the next used/refurbished unit.

computersavvy 07-15-2021 07:44 PM

Power supplies that are marginal can certainly show weird symptoms. Depending upon age it may be worth simply installing a new replacement power supply in one of the systems that is having the issue. If that fixes the issue then you can run for a lot longer on that system. If it does not fix it then you are not out a large investment and can try something else.

It is also worth inspecting the caps the next time the system is down just in case. Certain brands and styles are known to have failures that take down the system in strange ways. I don't remember the details, but in the late 90s and early 2000s there were some brands of motherboards known for cap failures that would take out the system.

jefro 07-15-2021 07:54 PM

The company I used to work had a two fail rule. Fail once, get going. Fail twice, replace.

Replacing the memory is likely an ESD damage but could be any issue. I assume that the system is not going into full power down. Instead of subjecting the system to abuse by playing with ram try this next time. Unplug power supply(s) and then press power button a few time. Return power and see if it powers up correctly.

obobskivich 07-15-2021 10:57 PM

Quote:

Originally Posted by computersavvy (Post 6266912)
It is also worth inspecting the caps the next time the system is down just in case. Certain brands and styles are known to have failures that take down the system in strange ways. I don't remember the details, but in the late 90s and early 2000s there were some brands of motherboards known for cap failures that would take out the system.

This was/is known as the 'cap plague' and was an upstream manufacturing issue at some large Taiwanese suppliers. I forget exactly what 'went wrong' but it had to do with bad batches of electrolyte (it only affects electrolytic capacitors), but I don't remember if that was just a chemistry mix-up or environmental (e.g. being produced somewhere with different weather/humidity/etc). It affected a massive range of hardware more or less indiscriminately, if they used caps from any of these suppliers (I forget which all suppliers were on the list, I believe Teapo was one of them though). This is also why 'Japanese capacitors' have become a marketing point (because those factories were largely unaffected by it), despite the issue being largely resolved in the mid-2000s. Wikipedia has an article about it: https://en.wikipedia.org/wiki/Capacitor_plague

Depending on the age of these servers, this is absolutely something to consider, but newer systems tend to have much less to worry about in terms of 'the plague.'

As far as cleaning the contacts on the RAM - I've seen PCIe graphics cards that refuse to engage/negotiate at the full x16 (and instead settle for x2 or x4 in an x16 slot), and after cleaning the card's contacts and blowing dust out of the motherboard slot, everything worked again. FWIW, I'd give it a try if it isn't too tedious to do (I know some servers can have silly numbers of individual DIMMs to deal with). Something else to consider, if these are really 'big' servers, if the RAM and/or CPUs are on risers, those can come unseated or (presumably) need their contacts blown out from dust/debris too - I've seen a handful of Compaq Proliant machines brought down just by risers being slightly unseated due to being moved.

I also agree with jefro's suggestion and would add that I don't envy having to troubleshoot this.

ccj4467 07-16-2021 05:18 AM

Thank you all. Its been a while since I have really dealt with a lot of electronic stuff. All of your suggestions are very helpful and give me at least a path for troubleshooting the problems. As for the capacitors that is a good one, I remember I had some Dell Poweredge 1950 servers, there was a cap on the HBA for the internal drives that would periodically go bad. Had a great electronics supplier close by so replacing the caps was a breeze.

Thanks again. I am marking this solved.


All times are GMT -5. The time now is 09:13 AM.