LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 03-21-2014, 03:00 AM   #1
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Rep: Reputation: 0
Linux Hardware Monitoring Tool


Hello Experts,

Recently we have been looking for a good monitoring tool which will help us figure out / or alert us in advace if the particula hard disk or CPU-FAN is going to be down.
I have myself searched over , and tried to use some of the existing tools like cacti, monitorix and all but they seem to be more of a operating system specific. Like they provide the information what you can easily get by running simple UNIX commands. But fails to provide some infomration that we can get about hard disk bad-blocs and System temp going high because of the FAN not in use/Fan disorder etc.
Is there any such tool/s they please point me in the right direction and I shall explore.
For records , we are using SLES Linux in environment. But for that matter we can setup any operating system just as the monitoring tool server.
Any question then please let know.Thanx.

Regards,
Admin
 
Old 03-21-2014, 03:28 AM   #2
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,475

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
A lot of this depends on your hardware and what your hardware provider choses to expose.

I can only comment on my own situation, here we only use HP servers (DL320/DL360/DL380) and install the relevant HP system monitoring tools, these expose a LOT of data which can be read with SNMP calls.

We use NAGIOS to make the SNMP calls to read and alert on the various items. We usually just monitor the "overall" levels, for example the "Thermal Condition" OID, "Thermal Fan Status" OID and the status of individual drives.
 
Old 03-21-2014, 05:13 AM   #3
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Thanx for the reply.
That makes sense. May I know how do you use Nagios with SNMP to monitor the health?
Thanx.
 
Old 03-21-2014, 05:28 AM   #4
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,475

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
We have a nagios instance in each data center on a separate machine.

These then make normal SNMP queries through the HP OIDs, for example:

Overall Thermal Condition - .1.3.6.1.4.1.232.6.2.6.1.0 and if it returns a result other than 2 (OK) we raise an alert.
Overall Fan Status - .1.3.6.1.4.1.232.6.2.6.5.0 again if it returns a result other then 2 we raise an alert.

We also have System Management Homepage enabled on the servers so that if any of the checks return an alert we use HP's own page to investigate.
By using the "Overall" conditions we can monitor for issues but not have to know that a specific system has X/Y/Z number of temperature sensors or fans.

We also have similar for individual drives and power supplies.
 
Old 03-21-2014, 11:51 PM   #5
haertig
Senior Member
 
Registered: Nov 2004
Distribution: Debian, Ubuntu, LinuxMint, Slackware, SysrescueCD, Raspbian, Arch
Posts: 2,331

Rep: Reputation: 357Reputation: 357Reputation: 357Reputation: 357
You can look at "psensor" for monitoring temps. But realize, that if you're expecting to catch rising CPU temps and "have it alert you", by the time it does, the problem will have probably "fixed" itself (i.e., you computer would have hit thermal shutdown). CPU temps can spike very quickly, going from "starting to get warm" up to "thermal shutdown" in just a few seconds. Generally it will take a little longer assuming you have adequate cooling and good thermal paste. But if the high temps are because your CPU fan went out, when your system hits a high load average you're going to get into trouble fast. Too fast for an email/page to alert you for you to respond manually.

I doubt you are going to find a larger, generic, monitoring solution like Nagios, Xymon, Cacti, etc. that is going to monitor CPU temps and physical hard disk errors straight out of the box. Usually you will have to write, or find to download, some script that does the actual monitoring and then feeds that info into Nagios/Xymon/Cacti/etc. Scripts that monitor that type of low level stuff WILL be operating system specific. You may be able to detect a failed fan and alert yourself, and if you can respond BEFORE load gets high, you may be able to do something about it before you hit a thermal shutdown situation. But that depends on not only monitoring, but luck as well. I would recommend looking at an automated shutdown script triggered by fast rising CPU temps rather than a manual alert from a monitoring system.

For a failing harddisk you usually have more time. Not always a lot of time, but generally much more time than a frying CPU gives you.
 
Old 03-22-2014, 02:17 PM   #6
LinuGeek
Member
 
Registered: Jun 2008
Posts: 126

Original Poster
Rep: Reputation: 0
Thanx haertig for the detailed reply.

I agree your point of view over CPU temp. My requirement is that if CPU FAN is defect then I should know not immediately but atleast I should know before the system shutdown itselfs because of high temp.

And yes if I can find out the simple command to monitor either FAN defect or High temp then I will feed it to Nagios to send me an alert.
And thats the minimum thing I am expecting here.

About the hard disk , as we still have reasonable time after getting some bad blocks till the actual failure of the entire hard disk , the simple alert regarding the bad blocks would be suffice.

So now that I have elaborated , anyone here with free tools to alert me for hard disk bad bloks? If possible defect FAN as well?

Thanx in advance.


Regards,
Admin
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
tool used for performance monitoring in LINUX ? aparna8877 Linux - General 4 12-21-2011 12:06 AM
Need what was the best monitoring tool used in linux saravanakumar Linux - Server 3 07-12-2011 03:38 PM
Linux Monitoring tool cozcol Linux - General 6 06-15-2009 11:59 PM
Hardware monitoring tool for SUSE10 katkota Linux - Newbie 4 03-09-2009 09:36 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 04:57 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration