LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 04-27-2009, 04:11 PM   #1
carlovee
LQ Newbie
 
Registered: Apr 2009
Posts: 5

Rep: Reputation: 0
Spontaneous server reboots every 2 weeks - Solved


Hi all,

I have a redhat 4.0 amd dual opteron server that started spontaneously rebooting in exact 2 week intervals a few months ago.

I've actually sat in front of it and watched it reboot. I had top running and there was no process activity that I saw, it just instantly rebooted without going through any shutdown. For this reason I think its hardware related not software.

I don't have any crons running that would explain this.

At first I thought it was related to reaching exactly 14 days uptime but that is not the case. It reboots at 6:30pm every other thursday regardless of uptime.

Its on a APC battery backup, could that be it? It also has a Silicon Image sata raid card.

I'm really going crazy over this one. I've searched everything I can and have come up with a big zero.

Thanks in advance for your help.

Tom


Here's some hardware info....
[root@bigbob ~]# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 2590.363
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 pni syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni
bogomips : 5183.34

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 2590.363
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 pni syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni
bogomips : 5177.81

[root@bigbob ~]# lspci
00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:01.0 RAID bus controller: 3ware Inc 7xxx/8xxx-series PATA/SATA-RAID (rev 01)
02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)
02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)
03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:05.0 Mass storage controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
03:08.0 Ethernet controller: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 10)

Last edited by carlovee; 05-07-2009 at 05:33 PM.
 
Old 04-27-2009, 04:30 PM   #2
carlovee
LQ Newbie
 
Registered: Apr 2009
Posts: 5

Original Poster
Rep: Reputation: 0
maybe solved...

ok, I just posted that message and I got to thinking about that APC Back UPS 1000 unit, did a little searching, and came across this from their manual:

"Self-test
The UPS performs a self-test automatically when
turned “On”, and every two weeks thereafter (by
default). Automatic self-test eliminates the need
for periodic manual self-tests.
During the self-test, the UPS briefly operates the
loads on-battery. If the UPS passes the self-test,
it returns to on-line operation."

Its supposed to come up with a 'bad battery' indicator if it fails this test and that isn't happening but still this would explain a lot.

I'll take it off the APC this weekend and then next Thursday evening I'll find out if that was it. My guess is that it is. Grrrrr, I swear this UPS has caused more downtime then its prevented.

I'll update this once I know for sure.

cheers,
Tom
 
Old 04-27-2009, 04:33 PM   #3
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi -

The first thing I'd do is open the case, blow it out thoroughly, and make sure all the fans are functioning correctly. Overheating is the single most common cause for "spontaneous reboots".

Outside of that, however, there are countless reasons it *could* be failing: including CPU, motherboard, power supply and/or one or more peripherals (especially on-board peripherals, like an on-board graphics controller, NIC, sound or SATA controller).

You should also find and run comprehensive diagnostics on the entire system: you'd be surprised what might turn up! I did this yesterday on my Mom's Dell ... which happens to ship with a built-in diagnostics partition.

Here are a couple of links:
http://ask.slashdot.org/article.pl?sid=05/08/02/165231
http://www.linux.com/feature/55366

'Hope that helps .. PSM
 
Old 04-28-2009, 01:41 PM   #4
salasi
Senior Member
 
Registered: Jul 2007
Location: Directly above centre of the earth, UK
Distribution: SuSE, plus some hopping
Posts: 4,070

Rep: Reputation: 897Reputation: 897Reputation: 897Reputation: 897Reputation: 897Reputation: 897Reputation: 897
Quote:
Originally Posted by carlovee View Post

"Self-test
The UPS performs a self-test automatically when
turned “On”, and every two weeks thereafter (by
default). Automatic self-test eliminates the need
for periodic manual self-tests...
...and aliminates the need for a large uptime counter, by the sounds of things

It is possible that the output of the UPS just suffers a glitch at switch over, but it would hardly be confidence inspiring (its going to do the same when its used in anger as when its just practising) and I'm not sure how you quantify whether the glitch is in limits or not. certainly, if your server power supply is on the edge, it is going to be more susceptible to this than if its comfortably rated.

Quote:
Grrrrr, I swear this UPS has caused more downtime then its prevented.
It is a disturbingly common experience that UPSs (and raid arrays) are bought to make one type of failure go away and end up causing more of another. Some years ago I used to have a generously-rated APC on my desktop, and it worked well once it was configured, but I had to go through half a dozen cycles of testing the setup before I reached that point. Prior to that, it would always shutoff prematurely due to one parameter or another being set non-optimally in the driver and it was all too easy to think that every time you had found something badly configured that it was the only thing badly configured.

You could argue, in that case, for going to 'run 'till you've got no more battery capacity' mode; you probably wouldn't want to do that on a server, but at least that didn't do the irritating thing of shutting down with lots of power left! Of course, if the default config had been close....but that's another matter.
 
Old 04-28-2009, 03:01 PM   #5
Electro
LQ Guru
 
Registered: Jan 2002
Posts: 6,042

Rep: Reputation: Disabled
A UPS for servers should be an in-line type. You should never use stand-by. A stand-by type is not a UPS. A stand-by is false UPS. A UPS with AVR can in some cases cause more harm than it helps. Probably a better way to go and being in more control is staying off the grid. Google has their own UPS that they constructed.

A UPS really does not protect a computer because it is another device that can go wrong. A surge suppressor should be used after a UPS to protect the computer from the UPS when the UPS sends a spike out to its outputs.

In your case, you probably have a defective UPS or the UPS is having problems being stable with your mains.

FYI, Silicon Image controllers are not RAID. They are just storage controllers. Your 3ware controller is a RAID controller.
 
Old 05-07-2009, 05:45 PM   #6
carlovee
LQ Newbie
 
Registered: Apr 2009
Posts: 5

Original Poster
Rep: Reputation: 0
Yep, it was the battery backup. It was doing a self-test every 2 weeks at exactly 6:30pm and interrupting the power to the server.

Thanks for the ideas and suggestions.

I'm going UPS-less for now, they are more trouble then they're worth. Besides if I was worried about 'brown-outs' or other power interruptions horking the data on my disks I've just had 9 or 10 hard reboots and the disks/data has come back fine each time.

thanks again,
Tom
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Linksys BEFSX41 Spontaneous Reboots taylor_venable Linux - Networking 2 06-09-2006 08:30 AM
CMOS checksum error and spontaneous reboots frieza Linux - Hardware 7 06-30-2004 02:49 PM
DMA intermittently disabled, spontaneous reboots and hangs ... Tinkster Linux - Hardware 3 04-01-2004 01:54 PM
Spontaneous reboots Alderian Linux - Newbie 7 08-21-2003 08:40 AM
Spontaneous Machine Reboots? ifm Linux - General 7 08-09-2002 05:21 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 04:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration