LinuxQuestions.org - Linux Hardware Monitoring Tool

- Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)

- - Linux Hardware Monitoring Tool (https://www.linuxquestions.org/questions/linux-hardware-18/linux-hardware-monitoring-tool-4175498945/)

Linux Hardware Monitoring Tool

Hello Experts,

Recently we have been looking for a good monitoring tool which will help us figure out / or alert us in advace if the particula hard disk or CPU-FAN is going to be down.
I have myself searched over , and tried to use some of the existing tools like cacti, monitorix and all but they seem to be more of a operating system specific. Like they provide the information what you can easily get by running simple UNIX commands. But fails to provide some infomration that we can get about hard disk bad-blocs and System temp going high because of the FAN not in use/Fan disorder etc.
Is there any such tool/s they please point me in the right direction and I shall explore.
For records , we are using SLES Linux in environment. But for that matter we can setup any operating system just as the monitoring tool server.
Any question then please let know.Thanx.

Regards,
Admin

A lot of this depends on your hardware and what your hardware provider choses to expose.

I can only comment on my own situation, here we only use HP servers (DL320/DL360/DL380) and install the relevant HP system monitoring tools, these expose a LOT of data which can be read with SNMP calls.

We use NAGIOS to make the SNMP calls to read and alert on the various items. We usually just monitor the "overall" levels, for example the "Thermal Condition" OID, "Thermal Fan Status" OID and the status of individual drives.

Thanx for the reply.
That makes sense. May I know how do you use Nagios with SNMP to monitor the health?
Thanx.

We have a nagios instance in each data center on a separate machine.

These then make normal SNMP queries through the HP OIDs, for example:

Overall Thermal Condition - .1.3.6.1.4.1.232.6.2.6.1.0 and if it returns a result other than 2 (OK) we raise an alert.
Overall Fan Status - .1.3.6.1.4.1.232.6.2.6.5.0 again if it returns a result other then 2 we raise an alert.

We also have System Management Homepage enabled on the servers so that if any of the checks return an alert we use HP's own page to investigate.
By using the "Overall" conditions we can monitor for issues but not have to know that a specific system has X/Y/Z number of temperature sensors or fans.

We also have similar for individual drives and power supplies.

You can look at "psensor" for monitoring temps. But realize, that if you're expecting to catch rising CPU temps and "have it alert you", by the time it does, the problem will have probably "fixed" itself (i.e., you computer would have hit thermal shutdown). CPU temps can spike very quickly, going from "starting to get warm" up to "thermal shutdown" in just a few seconds. Generally it will take a little longer assuming you have adequate cooling and good thermal paste. But if the high temps are because your CPU fan went out, when your system hits a high load average you're going to get into trouble fast. Too fast for an email/page to alert you for you to respond manually.

I doubt you are going to find a larger, generic, monitoring solution like Nagios, Xymon, Cacti, etc. that is going to monitor CPU temps and physical hard disk errors straight out of the box. Usually you will have to write, or find to download, some script that does the actual monitoring and then feeds that info into Nagios/Xymon/Cacti/etc. Scripts that monitor that type of low level stuff WILL be operating system specific. You may be able to detect a failed fan and alert yourself, and if you can respond BEFORE load gets high, you may be able to do something about it before you hit a thermal shutdown situation. But that depends on not only monitoring, but luck as well. I would recommend looking at an automated shutdown script triggered by fast rising CPU temps rather than a manual alert from a monitoring system.

For a failing harddisk you usually have more time. Not always a lot of time, but generally much more time than a frying CPU gives you.

Thanx haertig for the detailed reply.

I agree your point of view over CPU temp. My requirement is that if CPU FAN is defect then I should know not immediately but atleast I should know before the system shutdown itselfs because of high temp.

And yes if I can find out the simple command to monitor either FAN defect or High temp then I will feed it to Nagios to send me an alert.
And thats the minimum thing I am expecting here.

About the hard disk , as we still have reasonable time after getting some bad blocks till the actual failure of the entire hard disk , the simple alert regarding the bad blocks would be suffice.

So now that I have elaborated , anyone here with free tools to alert me for hard disk bad bloks? If possible defect FAN as well?

Thanx in advance.

Regards,
Admin