Hardware failure checks

elthox · 10-27-2006, 08:13 AM

Hi,

Id be grateful if anyone could help me in hardware failures checks. Up to now I have been familiar with HP-UX and im working these last days on a new suse linux platform. My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors.

To be more specific;

For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures.

These kinds of events were captured in my script like this:

FLAG=$(cat /opt/resmon/event.log|egrep -i "power|Hardware|overtemp|temperature|disk|enclosure|fan|adapter"|wc -l)
if [ $FLAG -gt 0 ]
then
bla bla bla........

So if any record in the log has a pattern like this (power...it may be a power supply problem, or If I catch the word disk in the log it maybe the disk failure) it makes me doubt and take the precautions in time without being late because we work here on live platforms related to GSM. THe idea here is the automation of sending alarms through sms-s. So if i notice something wrong in the log, i have created a script to catch this pattern and send it by sms to the support team in real time.

All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors? As we cannot simulate any failure in our platform I dont know how these kind of errors are represented in the log?

I hope that I have been clear in my explanations

THank you

unSpawn · 10-27-2006, 10:38 AM

Hello and welcome to LQ. Hope you like it here.

My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors. (..) / All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors?
If properly configured (sources, loglevels): syslog (/etc/syslog.conf) is what the kernel uses to dump logs in.
Then there's your other daemon logs if they don't log to syslog.

For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures. These kinds of events were captured in my script like this:
FLAG=$(cat /opt/resmon/event.log|egrep
That's horrible. If you don't want to deploy a fullscale network IT Service Management framework at least use something local like Monit: it will restart services on error and perform custom tasks, keep tabs on SAR-like specs and alert you. Top it off with something like Logwatch. Saves you time configuring grep rules and is easily extendable.

The idea here is the automation of sending alarms through sms-s.
Then you better know the dependencies of wanting that. If for instance the machine loses all network then alerting goes to hell unless you have alternatives.

As we cannot simulate any failure in our platform
I'm sorry to sound negative but that's plain irresponsible. If you're working with critical machines you must also have a workbench comprising of some testing servers. How else are you gonna test and make sure any reconfiguration, SW or HW upgrade or restore or whatever else can be performed flawless? If it's a matter of money then someone just hasn't got his priorities straight (which he'll find out in no time). Just my thoughts.