LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices



Reply
 
Search this Thread
Old 04-05-2013, 01:37 PM   #1
rnturn
Member
 
Registered: Jan 2003
Location: Illinois (Chicago area)
Distribution: Red Hat (8.0), SuSE (10.x, 11.x, 12.2, 13.2), Solaris (8-10), Tru64
Posts: 982

Rep: Reputation: 53
System powering itself off? PS problem? MB problem?


I have a system that seems to like to power itself off at seemingly random times. It often will go for a month (or longer) without any problems but then it may power itself off daily for several days or even multiple times in a day. It's now been over 10 days since the last spontaneous power down. (Oh do I hate intermittent hardware problems.)

The system acts as though it had acted on a "shutdown -h now" command but there are no log file entries that indicate that it's a case where the kernel has detected something serious and decided to shut itself down. A while back I wound up creating a cron job that writes a message into /var/log/messages so I can narrow down the time that the system went down to a five minute window. Based on what I've seen in the log file, there's no rhyme or reason to the times when the system decides to shut off.

Background on the system: It's running a Gigabyte GA-965P-S3 motherboard and the power supply is from Antec (it's one of the "green" power supplies). The OS is a fairly old version of SuSE and is definitely due for an upgrade. (I'm holding off on an OS upgrade until I get a better handle on what's causing the system to power down randomly. I'm not keen on changing too many things at a time.)

I have several theories as to what might be happening:
  • the system's power supply might be sensitive to humidity levels and shuts itself off. The area where the system is located is not hot. (It is comfortable for humans.) All the system fans have been vacuumed and there shouldn't be anything blocking air flow that would cause the PS to overheat. It is still heating season and the humidity can vary a lot which makes me wonder if there's any correlation to low humidity and the outages.
  • the motherboard might be in the initial stages of failing. I've done a brief check of the motherboard and I didn't see anything obvious like bulging capacitors.
  • the system is running our mail server (Courier+Postfix) and I had at one time thought that a mischievous emailer might have exploited some bug and was shutting the system down. (Mail logs don't seem to show anything arriving just before the shutdowns so I now figure this is a highly unlikely cause.)
Feel free to offer your own theories. I'm listening.

Any suggestions are most welcome...

TIA...

--
Rick
 
Old 04-05-2013, 02:00 PM   #2
pan64
Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 5,146

Rep: Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364Reputation: 1364
have you checked the temp sensors? or fan rpm limits? You should not say something like "there shouldn't be anything blocking air flow that would cause the PS to overheat" but check if it was true! if it looks like a normal shutdown you can probably insert some script into /etc/rcX.d to log processes, resources, sensors or whatever...
 
Old 04-11-2013, 11:32 AM   #3
rnturn
Member
 
Registered: Jan 2003
Location: Illinois (Chicago area)
Distribution: Red Hat (8.0), SuSE (10.x, 11.x, 12.2, 13.2), Solaris (8-10), Tru64
Posts: 982

Original Poster
Rep: Reputation: 53
Quote:
Originally Posted by pan64 View Post
You should not say something like "there shouldn't be anything
blocking air flow that would cause the PS to overheat" but check if it was true!
Uh... I did that. Vacuuming all of the air intakes and the fan is about all one can do. Well, short of removing the PS, opening it up (violating any warranty that might still be in effect), and doing even more vacuuming.

Quote:
if it looks like a normal shutdown you can probably insert some script into /etc/rcX.d to log processes, resources, sensors or whatever...
If a normal shutdown had been followed there would have been messages in /var/log/messages telling me that the PostgreSQL database had been successfully shut down. There weren't so I eliminated normal (but unauthorized) shutdowns as a possibility.

Plus -- and I didn't mention this in the earlier post -- once the system is restarted, all disk partitions end up going through a journal recovery which would not happen had the system been shutdown normally.

I haven't enabled any temperature or fan sensors on the system. The Linux that is on there is old enough that the set up of the sensors was requiring a lot of manual configuration (unlike the more recent releases) and when the system was built I wasn't in a position to take a lot of time fiddling around with the arcane sensors.conf configuration syntax. Especially, when the man page for "sensors.conf" told me to hunt for information under "/proc" that didn't even exist (for example, there is no "/proc/sys/dev/sensors"), I figured that the sensor package wasn't really ready for prime time yet and was going to be more trouble than I needed at the time. I know that more recent releases do that set up more easily (pretty much automatic for most hardware; at least as far as I've seen for motherboards) and, once I can make a decision as to whether the motherboard is healthy, I expect that I'll get all that temperature and fan sensor goodness when I do the OS upgrade. Bottom line: I'm still looking for some indication as to what hardware may be at fault. (I can't afford to make a wholesale replacement of the system at this time.)

There have been six more days of operation without a hiccup. (Figures that the system would quit misbehaving once I got fed up and started asking LQers for tips.)

--
Rick
 
Old 04-11-2013, 12:29 PM   #4
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
It's probably the PSU, do you have another one to test with ?
 
Old 04-11-2013, 02:35 PM   #5
haertig
Senior Member
 
Registered: Nov 2004
Distribution: Debian, Ubuntu, LinuxMint, Slackware, SysrescueCD
Posts: 2,032

Rep: Reputation: 309Reputation: 309Reputation: 309Reputation: 309
I would recommend learning how to configure sensors.conf. You may think it is arcane, but it may tell you what is going on. Is your system under heavier load, meaning higher temps, right before the shutdown? You may be able to determine something about this using a modification of your data timestamping script that writes to /var/log/messages. Change that so it runs "uptime" instead, since part of the uptime output is the CPU load. Running this every five minutes may not catch a load spike that causes overheating. running it more frequently may clog your logfile. I might try running it every minute, but configure the script to only write to the logfile if the load is above some XXX point you set.

I have a system that recently started shutting down unexpectedy after having been fine for years. It was a CPU heat problem. The fans were clean and running fine, but I believe is was a problem with the thermal-paste-to-heatsink interface. It probably had a heat issue since day one, or maybe the thermal paste just deteriorated over time, but because it had been used only for light duty stuff for years, the symptoms of high heat never appeared. When I started doing more heavy duty stuff on it (transcoding video to be specific), it started crashing frequently during times of high load. Looking at the temp sensors told me what was going on. It only took about 30 seconds of high load to cause it to spike really high temps and shutdown (not a graceful shutdown, it was a "crash").
 
Old 04-11-2013, 04:08 PM   #6
rnturn
Member
 
Registered: Jan 2003
Location: Illinois (Chicago area)
Distribution: Red Hat (8.0), SuSE (10.x, 11.x, 12.2, 13.2), Solaris (8-10), Tru64
Posts: 982

Original Poster
Rep: Reputation: 53
Quote:
Originally Posted by H_TeXMeX_H View Post
It's probably the PSU, do you have another one to test with ?
I have a spare PS sitting on the shelf. It was bought roughly the same time as the MB so it should have the correct connector types. There's really nothing internal to the case that needs power besides the DVD drive. All the main storage is external so I don't anticipate that PS wattage will be an issue. All I need to do is find a window when nobody needs to use the system.

--
Rick
 
Old 04-11-2013, 05:01 PM   #7
rnturn
Member
 
Registered: Jan 2003
Location: Illinois (Chicago area)
Distribution: Red Hat (8.0), SuSE (10.x, 11.x, 12.2, 13.2), Solaris (8-10), Tru64
Posts: 982

Original Poster
Rep: Reputation: 53
Quote:
Originally Posted by haertig View Post
I would recommend learning how to configure sensors.conf. You may think it is arcane, but it may tell you what is going on.
Well... I went through the process of running "sensors-config" -- accepted defaults for all the questions that it presented -- and now "lm_sensors" will not start. (Great... :^( ) I'm leaning more and more toward this being a PS-related problem. When it goes down the next time, I'll take the extra time ("You users will just have to wait another ten minutes, OK?") to swap in a different PS and we'll see if that clears up the spontaneous shutdowns. Then I can upgrade the OS to a current release; my systems that are running OpenSUSE 12.x seem to have sensor information available following the initial installation. Perhaps I could be getting more out of lm_sensors but at least the basics are there on those systems.

Quote:
Is your system under heavier load, meaning higher temps, right before the shutdown? You may be able to determine something about this using a modification of your data timestamping script that writes to /var/log/messages. Change that so it runs "uptime" instead, since part of the uptime output is the CPU load.
I have something that is currently writing a breadcrumb-type messages into /var/log/messages now. I can modify that to use the output of uptime as the message.

Quote:
I have a system that recently started shutting down unexpectedy after having been fine for years. It was a CPU heat problem. The fans were clean and running fine, but I believe is was a problem with the thermal-paste-to-heatsink interface. It probably had a heat issue since day one, or maybe the thermal paste just deteriorated over time, but because it had been used only for light duty stuff for years, the symptoms of high heat never appeared. When I started doing more heavy duty stuff on it (transcoding video to be specific), it started crashing frequently during times of high load. Looking at the temp sensors told me what was going on. It only took about 30 seconds of high load to cause it to spike really high temps and shutdown (not a graceful shutdown, it was a "crash").
Interesting. This doesn't look like your typical crash as the system is fully powered down when I walk up to it. It's not hung or sitting there with a panic message on the console. It just looks like a little gremlin pushed the power button. This system should NOT be heavily loaded at the times it is going down. (Not that it doesn't from time to time.) One of its functions is as a mail server and that function normally doesn't tax the system very heavily; except after one of these power-down incidents when our ISP sends along all the email that couldn't be delivered during the time the system went down and when I brought it back up. I've looked at logs before to see what activities might have been going on at the time the system powers off and haven't seen anything suspicious.

I've modified my cron/logger utility to include the system load and I'll see what it tells me. Now I have to wait until it fails. (It's a little weird cheering for it to up and die on me so I can look at the logs.)

--
Rick
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem downing wlan0 and powering down the box manimal29 Red Hat 1 06-08-2009 11:13 AM
Problem powering on virtual machine in Vmware 6.5.1 accordeoniste Linux - Software 0 02-26-2009 03:22 PM
linux server system cannot resolve name, but windows system no problem? hocheetiong Linux - Newbie 3 04-06-2008 08:28 PM
LXer: Linux, other free open source software powering the new Austrian Health Card system LXer Syndicated Linux News 0 02-12-2006 11:16 AM
Powering off when powering off, how hard can it be? Ian_Hawdon Slackware 14 01-19-2006 04:19 PM


All times are GMT -5. The time now is 12:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration