LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > SUSE / openSUSE
User Name
Password
SUSE / openSUSE This Forum is for the discussion of Suse Linux.

Notices


Reply
  Search this Thread
Old 10-27-2006, 08:13 AM   #1
elthox
Member
 
Registered: Oct 2006
Posts: 41

Rep: Reputation: 15
Hardware failure checks


Hi,

Id be grateful if anyone could help me in hardware failures checks. Up to now I have been familiar with HP-UX and im working these last days on a new suse linux platform. My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors.

To be more specific;

For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures.

These kinds of events were captured in my script like this:

FLAG=$(cat /opt/resmon/event.log|egrep -i "power|Hardware|overtemp|temperature|disk|enclosure|fan|adapter"|wc -l)
if [ $FLAG -gt 0 ]
then
bla bla bla........

So if any record in the log has a pattern like this (power...it may be a power supply problem, or If I catch the word disk in the log it maybe the disk failure) it makes me doubt and take the precautions in time without being late because we work here on live platforms related to GSM. THe idea here is the automation of sending alarms through sms-s. So if i notice something wrong in the log, i have created a script to catch this pattern and send it by sms to the support team in real time.

All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors? As we cannot simulate any failure in our platform I dont know how these kind of errors are represented in the log?

I hope that I have been clear in my explanations

THank you
 
Old 10-27-2006, 10:38 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Hello and welcome to LQ. Hope you like it here.

My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors. (..) / All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors?
If properly configured (sources, loglevels): syslog (/etc/syslog.conf) is what the kernel uses to dump logs in.
Then there's your other daemon logs if they don't log to syslog.


For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures. These kinds of events were captured in my script like this:
FLAG=$(cat /opt/resmon/event.log|egrep

That's horrible. If you don't want to deploy a fullscale network IT Service Management framework at least use something local like Monit: it will restart services on error and perform custom tasks, keep tabs on SAR-like specs and alert you. Top it off with something like Logwatch. Saves you time configuring grep rules and is easily extendable.


The idea here is the automation of sending alarms through sms-s.
Then you better know the dependencies of wanting that. If for instance the machine loses all network then alerting goes to hell unless you have alternatives.


As we cannot simulate any failure in our platform
I'm sorry to sound negative but that's plain irresponsible. If you're working with critical machines you must also have a workbench comprising of some testing servers. How else are you gonna test and make sure any reconfiguration, SW or HW upgrade or restore or whatever else can be performed flawless? If it's a matter of money then someone just hasn't got his priorities straight (which he'll find out in no time). Just my thoughts.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
hardware failure troubleshoot danimalz Linux - Hardware 1 06-29-2006 07:34 AM
Not another hardware failure! :( king111 Linux - Hardware 4 08-29-2005 06:06 PM
Possible hardware failure, something with IDE? Sonderblade Linux - Hardware 3 06-30-2005 11:40 AM
hardware failure? Smerk Linux - Hardware 2 03-10-2003 07:12 AM
How to check hardware failure? sanglih Linux - Hardware 10 06-24-2002 01:24 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > SUSE / openSUSE

All times are GMT -5. The time now is 08:00 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration