Quote:
Originally Posted by sluge
So, SMART affects only HW disk errors
|
Not quite; In the Linux world SMART monitors the disks for H/W & S/W problems. Controller problems (just like
certain filesystem problems) are usually caught via other monitoring tools.
Quote:
Originally Posted by sluge
...as I know, HW erros also can be on I/O controller, raid controller and other HW.
|
If you're looking for something that does "From the Disk, up" monitoring, look into SNMP. I would also suggest you take a crack at lm_sensors, though if you have a wide variety of "generic" x86 hardware, the tweaking and tuning of things like voltage and fan monitors can turn into a headache (this is part of the reason why you see large datacenters standardize on a few models of servers, rather than 50 different kinds).
Quote:
Originally Posted by sluge
...solaris iostat -Ee includes such types of errors, but also is very useful to get SW disk I/O erros, like filesystems error and other. Solaris has sw errors, but linux not
|
The following SMART Attributes are incremented when a filesystem error "and other" are encountered (like the system being unable to read or write to a certain block).
Reallocated_Sector_Ct
Offline_Uncorrectable
Current_Pending_Sector
Quote:
Originally Posted by sluge
One more note:I see a lof ot cases when smart parameters are OK, but disk has a lot of errors that reported by operation system
|
I've seen that too, that's tyipcally when a "Predicitive Failure" alert is triggered (for example, via SNMP).
I've been a SysAdmin for Solaris longer than I've been for Linux, so I've seen the Pros & Cons of each OS when it comes to disk monitoring. I also know *why* each OS does things differently;
Solaris (10)
Disk-monitoring grew from a time prior to SMART, when you were lucky if the HDD vendor included any sort of testing.
The developers of the OS instituted a 'simple' type of from-the-OS disk monitoring that has remained consistent on the surface for the better part of 15 years.
Solaris (10) - Pros
- Takes a "From the OS" perspective.
- Simple categories for errors (H/W, S/W, and Transport
*) via 'iostat -en'
- Slightly more detailed error-counts via 'iostat -En'.
- Catches I/O and Controller errors
via FMD (not iostat).
Solaris (10) - Cons
- Each disk is monitored from the OS, not from the disk itself.
- Error counts are unreliable (are reset when the server reboots)
- Exactly what constitues a H/W, S/W, or Transport error depends on the nature of the failure, the controller, and the disk driver used.
Example: I've seen local FC-AL disks throw 5,000 Transport errors, but the hard drive was stone-dead (would not spin).
Linux
Most of the disk monitoring relies on the SMART built-in to most disks. As Linux grew into the "official" server world of SCSI, FC-AL, and fiber-based SAN storage, other methods of disk-based monitoring haven't 'popped-up' (or they've escaped me for the better part of a decade).
Linux - Pros
SMART does more than just "hardware" errors.
- logs the life of the disk (Power_On_Hours)
- Temperature of the disk
- Captures various 'software' errors (Reallocated_Sector_Ct, Offline_Uncorrectable, Current_Pending_Sector)
* = Keep in mind that "Hardware" errors are things like "device failed to respond in time", "Software" errors are read/write errors on a sector of the disk, and "Transport" errors are basically just SCSI timeouts or when a single path fails to a multi-pathed disk (even if temporary)
Solaris does not (typically) count Controller errors amongst the three aforementioned categories of errors, though this depends on the driver that reports the error (Example: 'qlc' is a Controller Driver, whereas 'sd' and 'ssd' are Disk Drivers).
Now, what it really sounds like is that you're looking for any type of OS-based 'error counters' within Linux, exactly like what Solaris does.
However, because the kernel-level subsystems are (in some cases, vastly) different, there is not an identical functionality that behaves the exact same way as in Solaris.
Basically, Solaris one type of road (think of a 2-way street), and Linux another (a 2-way highway). Both are roads, both get you from point A to point B. But both do so differently (though the basic goal remains the same).