Hardware failure checks
Hi,
Id be grateful if anyone could help me in hardware failures checks. Up to now I have been familiar with HP-UX and im working these last days on a new suse linux platform. My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors.
To be more specific;
For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures.
These kinds of events were captured in my script like this:
FLAG=$(cat /opt/resmon/event.log|egrep -i "power|Hardware|overtemp|temperature|disk|enclosure|fan|adapter"|wc -l)
if [ $FLAG -gt 0 ]
then
bla bla bla........
So if any record in the log has a pattern like this (power...it may be a power supply problem, or If I catch the word disk in the log it maybe the disk failure) it makes me doubt and take the precautions in time without being late because we work here on live platforms related to GSM. THe idea here is the automation of sending alarms through sms-s. So if i notice something wrong in the log, i have created a script to catch this pattern and send it by sms to the support team in real time.
All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors? As we cannot simulate any failure in our platform I dont know how these kind of errors are represented in the log?
I hope that I have been clear in my explanations
THank you
|