LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Suse/Novell
User Name
Password
Suse/Novell This Forum is for the discussion of Suse Linux.

Notices

Reply
 
Search this Thread
Old 10-27-2006, 09:13 AM   #1
elthox
Member
 
Registered: Oct 2006
Posts: 41

Rep: Reputation: 15
Hardware failure checks


Hi,

Id be grateful if anyone could help me in hardware failures checks. Up to now I have been familiar with HP-UX and im working these last days on a new suse linux platform. My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors.

To be more specific;

For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures.

These kinds of events were captured in my script like this:

FLAG=$(cat /opt/resmon/event.log|egrep -i "power|Hardware|overtemp|temperature|disk|enclosure|fan|adapter"|wc -l)
if [ $FLAG -gt 0 ]
then
bla bla bla........

So if any record in the log has a pattern like this (power...it may be a power supply problem, or If I catch the word disk in the log it maybe the disk failure) it makes me doubt and take the precautions in time without being late because we work here on live platforms related to GSM. THe idea here is the automation of sending alarms through sms-s. So if i notice something wrong in the log, i have created a script to catch this pattern and send it by sms to the support team in real time.

All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors? As we cannot simulate any failure in our platform I dont know how these kind of errors are represented in the log?

I hope that I have been clear in my explanations

THank you
 
Old 10-27-2006, 11:38 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 27,561
Blog Entries: 54

Rep: Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927Reputation: 2927
Hello and welcome to LQ. Hope you like it here.

My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors. (..) / All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors?
If properly configured (sources, loglevels): syslog (/etc/syslog.conf) is what the kernel uses to dump logs in.
Then there's your other daemon logs if they don't log to syslog.


For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures. These kinds of events were captured in my script like this:
FLAG=$(cat /opt/resmon/event.log|egrep

That's horrible. If you don't want to deploy a fullscale network IT Service Management framework at least use something local like Monit: it will restart services on error and perform custom tasks, keep tabs on SAR-like specs and alert you. Top it off with something like Logwatch. Saves you time configuring grep rules and is easily extendable.


The idea here is the automation of sending alarms through sms-s.
Then you better know the dependencies of wanting that. If for instance the machine loses all network then alerting goes to hell unless you have alternatives.


As we cannot simulate any failure in our platform
I'm sorry to sound negative but that's plain irresponsible. If you're working with critical machines you must also have a workbench comprising of some testing servers. How else are you gonna test and make sure any reconfiguration, SW or HW upgrade or restore or whatever else can be performed flawless? If it's a matter of money then someone just hasn't got his priorities straight (which he'll find out in no time). Just my thoughts.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hardware failure troubleshoot danimalz Linux - Hardware 1 06-29-2006 08:34 AM
Not another hardware failure! :( king111 Linux - Hardware 4 08-29-2005 07:06 PM
Possible hardware failure, something with IDE? Sonderblade Linux - Hardware 3 06-30-2005 12:40 PM
hardware failure? Smerk Linux - Hardware 2 03-10-2003 08:12 AM
How to check hardware failure? sanglih Linux - Hardware 10 06-24-2002 02:24 PM


All times are GMT -5. The time now is 06:09 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration