Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a redhat 4.0 amd dual opteron server that started spontaneously rebooting in exact 2 week intervals a few months ago.
I've actually sat in front of it and watched it reboot. I had top running and there was no process activity that I saw, it just instantly rebooted without going through any shutdown. For this reason I think its hardware related not software.
I don't have any crons running that would explain this.
At first I thought it was related to reaching exactly 14 days uptime but that is not the case. It reboots at 6:30pm every other thursday regardless of uptime.
Its on a APC battery backup, could that be it? It also has a Silicon Image sata raid card.
I'm really going crazy over this one. I've searched everything I can and have come up with a big zero.
Thanks in advance for your help.
Tom
Here's some hardware info....
[root@bigbob ~]# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 2590.363
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 pni syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni
bogomips : 5183.34
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 2590.363
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 pni syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni
bogomips : 5177.81
ok, I just posted that message and I got to thinking about that APC Back UPS 1000 unit, did a little searching, and came across this from their manual:
"Self-test
The UPS performs a self-test automatically when
turned “On”, and every two weeks thereafter (by
default). Automatic self-test eliminates the need
for periodic manual self-tests.
During the self-test, the UPS briefly operates the
loads on-battery. If the UPS passes the self-test,
it returns to on-line operation."
Its supposed to come up with a 'bad battery' indicator if it fails this test and that isn't happening but still this would explain a lot.
I'll take it off the APC this weekend and then next Thursday evening I'll find out if that was it. My guess is that it is. Grrrrr, I swear this UPS has caused more downtime then its prevented.
The first thing I'd do is open the case, blow it out thoroughly, and make sure all the fans are functioning correctly. Overheating is the single most common cause for "spontaneous reboots".
Outside of that, however, there are countless reasons it *could* be failing: including CPU, motherboard, power supply and/or one or more peripherals (especially on-board peripherals, like an on-board graphics controller, NIC, sound or SATA controller).
You should also find and run comprehensive diagnostics on the entire system: you'd be surprised what might turn up! I did this yesterday on my Mom's Dell ... which happens to ship with a built-in diagnostics partition.
"Self-test
The UPS performs a self-test automatically when
turned “On”, and every two weeks thereafter (by
default). Automatic self-test eliminates the need
for periodic manual self-tests...
...and aliminates the need for a large uptime counter, by the sounds of things
It is possible that the output of the UPS just suffers a glitch at switch over, but it would hardly be confidence inspiring (its going to do the same when its used in anger as when its just practising) and I'm not sure how you quantify whether the glitch is in limits or not. certainly, if your server power supply is on the edge, it is going to be more susceptible to this than if its comfortably rated.
Quote:
Grrrrr, I swear this UPS has caused more downtime then its prevented.
It is a disturbingly common experience that UPSs (and raid arrays) are bought to make one type of failure go away and end up causing more of another. Some years ago I used to have a generously-rated APC on my desktop, and it worked well once it was configured, but I had to go through half a dozen cycles of testing the setup before I reached that point. Prior to that, it would always shutoff prematurely due to one parameter or another being set non-optimally in the driver and it was all too easy to think that every time you had found something badly configured that it was the only thing badly configured.
You could argue, in that case, for going to 'run 'till you've got no more battery capacity' mode; you probably wouldn't want to do that on a server, but at least that didn't do the irritating thing of shutting down with lots of power left! Of course, if the default config had been close....but that's another matter.
A UPS for servers should be an in-line type. You should never use stand-by. A stand-by type is not a UPS. A stand-by is false UPS. A UPS with AVR can in some cases cause more harm than it helps. Probably a better way to go and being in more control is staying off the grid. Google has their own UPS that they constructed.
A UPS really does not protect a computer because it is another device that can go wrong. A surge suppressor should be used after a UPS to protect the computer from the UPS when the UPS sends a spike out to its outputs.
In your case, you probably have a defective UPS or the UPS is having problems being stable with your mains.
FYI, Silicon Image controllers are not RAID. They are just storage controllers. Your 3ware controller is a RAID controller.
Yep, it was the battery backup. It was doing a self-test every 2 weeks at exactly 6:30pm and interrupting the power to the server.
Thanks for the ideas and suggestions.
I'm going UPS-less for now, they are more trouble then they're worth. Besides if I was worried about 'brown-outs' or other power interruptions horking the data on my disks I've just had 9 or 10 hard reboots and the disks/data has come back fine each time.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.