Suse/NovellThis Forum is for the discussion of Suse Linux.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
Id be grateful if anyone could help me in hardware failures checks. Up to now I have been familiar with HP-UX and im working these last days on a new suse linux platform. My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors.
To be more specific;
For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures.
These kinds of events were captured in my script like this:
FLAG=$(cat /opt/resmon/event.log|egrep -i "power|Hardware|overtemp|temperature|disk|enclosure|fan|adapter"|wc -l)
if [ $FLAG -gt 0 ]
bla bla bla........
So if any record in the log has a pattern like this (power...it may be a power supply problem, or If I catch the word disk in the log it maybe the disk failure) it makes me doubt and take the precautions in time without being late because we work here on live platforms related to GSM. THe idea here is the automation of sending alarms through sms-s. So if i notice something wrong in the log, i have created a script to catch this pattern and send it by sms to the support team in real time.
All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors? As we cannot simulate any failure in our platform I dont know how these kind of errors are represented in the log?
My problem is that Im not very clear from where can I catch all the useful events that may contain failures and errors. (..) / All I want to know is how these kinds of hints can be applied in suse linux. Are there the same problems that the log show or may be other kind of critical errors?
If properly configured (sources, loglevels): syslog (/etc/syslog.conf) is what the kernel uses to dump logs in.
Then there's your other daemon logs if they don't log to syslog.
For example in HP-UX I used to supervize the /opt/resmon/event.log for any suspicious event that could give me important data about failures. These kinds of events were captured in my script like this:
That's horrible. If you don't want to deploy a fullscale network IT Service Management framework at least use something local like Monit: it will restart services on error and perform custom tasks, keep tabs on SAR-like specs and alert you. Top it off with something like Logwatch. Saves you time configuring grep rules and is easily extendable.
The idea here is the automation of sending alarms through sms-s.
Then you better know the dependencies of wanting that. If for instance the machine loses all network then alerting goes to hell unless you have alternatives.
As we cannot simulate any failure in our platform
I'm sorry to sound negative but that's plain irresponsible. If you're working with critical machines you must also have a workbench comprising of some testing servers. How else are you gonna test and make sure any reconfiguration, SW or HW upgrade or restore or whatever else can be performed flawless? If it's a matter of money then someone just hasn't got his priorities straight (which he'll find out in no time). Just my thoughts.