device error summary statistics

sluge · 03-12-2012, 06:37 AM

On my solaris I can type iostat -en to get device error summary statistic, but on linux iostat haven't -en parameters (
Is any way to get device error summary statistic on linux?

xeleema · 03-13-2012, 12:43 PM

Greetingz!

There isn't something that breif in Linux, we're stuck with smartctl.

1) Find out if your device supports SMART

Code:

root@linux# smartctl -i /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EACS-00D6B1
Serial Number:    WD-############
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Mar 13 12:33:20 2012 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@linux#

2) If so, then you can do a quick-check of the health with this;

Code:

root@linux# smartctl --health /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

root@linux#

Check the man-page of smartctl for more details. You may want to setup smartmon as well (or just write a custom script to call from cron).

sluge · 03-14-2012, 12:05 AM

Thanks, it is useful for me too, but I want to know, why linux doesn't have statistic about disk I/O errors?

xeleema · 03-14-2012, 10:16 AM

Quote:

Originally Posted by sluge

...why linux doesn't have statistic about disk I/O errors?

Linux simply relies on the SMART that's already there. Solaris' kernel keeps track of disk errors, but those statistics vaporize upon each reboot.
RTFM for smartctl.

Code:

root@linux# smartctl --all /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EACS-00D6B1
Serial Number:    WD-############
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Mar 14 10:13:44 2012 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (23400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   161   158   021    Pre-fail  Always       -       6908
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       130
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1387
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       130
194 Temperature_Celsius     0x0022   109   098   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   185   000    Old_age   Always       -       78
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@linux#

sluge · 03-15-2012, 12:46 AM

So, SMART affects only HW disk errors, but as I know, HW erros also can be on I/O controller, raid controller and other HW. So, I don't know is solaris iostat -Ee includes such types of errors, but also is very useful to get SW disk I/O erros, like filesystems error and other. Solaris has sw errors, but linux not
One more note:I see a lof ot cases when smart parameters are OK, but disk has a lot of errors that reported by operation system

xeleema · 03-15-2012, 04:49 AM

Quote:

Originally Posted by sluge

So, SMART affects only HW disk errors

Not quite; In the Linux world SMART monitors the disks for H/W & S/W problems. Controller problems (just like certain filesystem problems) are usually caught via other monitoring tools.

Quote:

Originally Posted by sluge

...as I know, HW erros also can be on I/O controller, raid controller and other HW.

If you're looking for something that does "From the Disk, up" monitoring, look into SNMP. I would also suggest you take a crack at lm_sensors, though if you have a wide variety of "generic" x86 hardware, the tweaking and tuning of things like voltage and fan monitors can turn into a headache (this is part of the reason why you see large datacenters standardize on a few models of servers, rather than 50 different kinds).

Quote:

Originally Posted by sluge

...solaris iostat -Ee includes such types of errors, but also is very useful to get SW disk I/O erros, like filesystems error and other. Solaris has sw errors, but linux not

The following SMART Attributes are incremented when a filesystem error "and other" are encountered (like the system being unable to read or write to a certain block).

Reallocated_Sector_Ct
Offline_Uncorrectable
Current_Pending_Sector

Quote:

Originally Posted by sluge

One more note:I see a lof ot cases when smart parameters are OK, but disk has a lot of errors that reported by operation system

I've seen that too, that's tyipcally when a "Predicitive Failure" alert is triggered (for example, via SNMP).

I've been a SysAdmin for Solaris longer than I've been for Linux, so I've seen the Pros & Cons of each OS when it comes to disk monitoring. I also know *why* each OS does things differently;

Solaris (10)
Disk-monitoring grew from a time prior to SMART, when you were lucky if the HDD vendor included any sort of testing.
The developers of the OS instituted a 'simple' type of from-the-OS disk monitoring that has remained consistent on the surface for the better part of 15 years.

Solaris (10) - Pros
- Takes a "From the OS" perspective.
- Simple categories for errors (H/W, S/W, and Transport*) via 'iostat -en'
- Slightly more detailed error-counts via 'iostat -En'.
- Catches I/O and Controller errors via FMD (not iostat).

Solaris (10) - Cons
- Each disk is monitored from the OS, not from the disk itself.
- Error counts are unreliable (are reset when the server reboots)
- Exactly what constitues a H/W, S/W, or Transport error depends on the nature of the failure, the controller, and the disk driver used.
Example: I've seen local FC-AL disks throw 5,000 Transport errors, but the hard drive was stone-dead (would not spin).

Linux
Most of the disk monitoring relies on the SMART built-in to most disks. As Linux grew into the "official" server world of SCSI, FC-AL, and fiber-based SAN storage, other methods of disk-based monitoring haven't 'popped-up' (or they've escaped me for the better part of a decade).

Linux - Pros
SMART does more than just "hardware" errors.
- logs the life of the disk (Power_On_Hours)
- Temperature of the disk
- Captures various 'software' errors (Reallocated_Sector_Ct, Offline_Uncorrectable, Current_Pending_Sector)

* = Keep in mind that "Hardware" errors are things like "device failed to respond in time", "Software" errors are read/write errors on a sector of the disk, and "Transport" errors are basically just SCSI timeouts or when a single path fails to a multi-pathed disk (even if temporary)

Solaris does not (typically) count Controller errors amongst the three aforementioned categories of errors, though this depends on the driver that reports the error (Example: 'qlc' is a Controller Driver, whereas 'sd' and 'ssd' are Disk Drivers).

Now, what it really sounds like is that you're looking for any type of OS-based 'error counters' within Linux, exactly like what Solaris does.
However, because the kernel-level subsystems are (in some cases, vastly) different, there is not an identical functionality that behaves the exact same way as in Solaris.

Basically, Solaris one type of road (think of a 2-way street), and Linux another (a 2-way highway). Both are roads, both get you from point A to point B. But both do so differently (though the basic goal remains the same).