LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Cause of server crash (https://www.linuxquestions.org/questions/linux-software-2/cause-of-server-crash-578107/)

bytez 08-18-2007 08:39 PM

Cause of server crash
 
Hi folks,

Here is the scenerio. I got a new dedicated server from softlayer.com 3 months ago, it's been running perfectly ever since. So, last week I decided to upgrade the memory from 2GB to 3GB. A couple of days later, it crashed, then 3 days later it happened again. The first time, server came up after a remote reboot. However, the second timem, it wouldn't come up. The tech said the memory was not seated correctly. It booted up fine after he re-seated them. And then today, it crashed again. :confused:

The system specs:

Intel Xeon 3060
3GB ECC DDR memory
2x250GB SATAII HDD
CentOS 4.5 32-bit with Cpanel 10
Kernel 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux

I'm suspecting it's due to faulty memory. I noticed that swap usage is always greater than .5% before the crash. 2GB is allocated to swap partition. Could it have anything to do with swap usage? How about the kernel version, anything that may have caused it to crash/hang? I thought that having 3GB of RAM is odd since most servers have an even numbered amount. Maybe I should upgrade to the latest stable CentOS kernel.

I asked the tech to swap out the ram and test the memory. If you guys have any ideas, please post it here. Thanks in advance.

ak_random 08-19-2007 12:47 AM

I'd try to simplest thing first: take out the new memory and see if that restores system stability.

macemoneta 08-19-2007 01:03 AM

You have ECC memory, but it sounds like you don't have ECC enabled into your BIOS (or bit errors would be corrected). Does your motherboard have chipkill functionality (to remove an entire RAM chip in the event of multiple failures)? If so, is the memory compatible and the function enabled? Have you confirmed that the memory is in fact ECC capable?

Boot memtest86 to check your RAM.

bytez 08-19-2007 01:30 AM

Quote:

Originally Posted by macemoneta (Post 2863481)
You have ECC memory, but it sounds like you don't have ECC enabled into your BIOS (or bit errors would be corrected). Does your motherboard have chipkill functionality (to remove an entire RAM chip in the event of multiple failures)? If so, is the memory compatible and the function enabled? Have you confirmed that the memory is in fact ECC capable?

Boot memtest86 to check your RAM.

Well, the sever was running flawless for 3 months prior to the addition of the 2x512MB sticks. Do you need to configure BIOS when you add/remove memory from the server? Maybe the tech disabled ECC by accident, hmmm. I'm not sure about the chipkill, it's a SuperMicro server.

The tech replaced all 4 sticks of memory, I hope it won't crash again. :(


All times are GMT -5. The time now is 11:18 AM.