LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   RAID degraded, partition missing from md0 (https://www.linuxquestions.org/questions/linux-hardware-18/raid-degraded-partition-missing-from-md0-4175483697/)

reano 11-07-2013 01:57 AM

RAID degraded, partition missing from md0
 
Hey guys,
We're having a very weird issue at work. Our Ubuntu server has 6 drives, set up with RAID1 as follows:

/dev/md0, consisting of:
/dev/sda1
/dev/sdb1

/dev/md1, consisting of:
/dev/sda2
/dev/sdb2

/dev/md2, consisting of:
/dev/sda3
/dev/sdb3

/dev/md3, consisting of:
/dev/sdc1
/dev/sdd1

/dev/md4, consisting of:
/dev/sde1
/dev/sdf1

As you can see, md0, md1 and md2 all use the same 2 drives (split into 3 partitions). I also have to note that this is done via ubuntu software raid, not hardware raid.

Today, the /md0 RAID1 array shows as degraded - it is missing the /dev/sdb1 drive. But since /dev/sdb1 is only a partition (and /dev/sdb2 and /dev/sdb3 are working fine), it's obviously not the drive that's gone AWOL, it seems the partition itself is missing.

How is that even possible? And what could we do to fix it?

My output of cat /proc/mdstat:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

md1 : active raid1 sda2[0] sdb2[1]
      24006528 blocks super 1.2 [2/2] [UU]


md2 : active raid1 sda3[0] sdb3[1]
      1441268544 blocks super 1.2 [2/2] [UU]


md0 : active raid1 sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]


md3 : active raid1 sdd1[1] sdc1[0]
      2930133824 blocks super 1.2 [2/2] [UU]


md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]


unused devices: <none>


Any help would be greatly appreciated!

evo2 11-07-2013 02:09 AM

Hi,

it's not so unusual to have problems with just one partition on a disk.

You can try to rebuild with the existing sdb, or you can replace the sdb and then rebuild. See for example http://www.howtoforge.com/replacing_..._a_raid1_array for the latter option.

However, before doing anything make sure you are familiar with: https://raid.wiki.kernel.org/index.php/Linux_Raid

Evo2.

reano 11-07-2013 02:13 AM

Quote:

Originally Posted by evo2 (Post 5059874)
Hi,

it's not so unusual to have problems with just one partition on a disk.

You can try to rebuild with the existing sdb, or you can replace the sdb and then rebuild. See for example http://www.howtoforge.com/replacing_..._a_raid1_array for the latter option.

However, before doing anything make sure you are familiar with: https://raid.wiki.kernel.org/index.php/Linux_Raid

Evo2.

Thanks Evo2. Can you please explain how I'd go about trying the first option (rebuild with the existing sdb)? Safely, that is :P

evo2 11-07-2013 02:23 AM

Hi,

Quote:

Originally Posted by reano (Post 5059877)
Thanks Evo2. Can you please explain how I'd go about trying the first option (rebuild with the existing sdb)? Safely, that is :P

didn't remember off the top of my head but from a quick scan of https://raid.wiki.kernel.org/index.php/Reconstruction and the mdadm man page it looks like the first thing to try should be:
Code:

mdadm --assemble --scan
However, please check for yourself.

Evo2.

reano 11-07-2013 02:28 AM

Quote:

Originally Posted by evo2 (Post 5059883)
Hi,



didn't remember off the top of my head but from a quick scan of https://raid.wiki.kernel.org/index.php/Reconstruction and the mdadm man page it looks like the first thing to try should be:
Code:

mdadm --assemble --scan
However, please check for yourself.

Evo2.

Thanks - I've been doing a bit of reading on mdadm --assemble as well. Will this not damage or endanger any of the other raid devices or the raid setup itself? I can't have any of the other partitions or md-devices go down, as our mail services etc run on this same server.

reano 11-07-2013 06:02 AM

Actually, let me clarify - if I do a:

Code:

mdadm --assemble --scan
Then it will essentially be doing the same as:

Code:

mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1
My main concern here is, while it's doing that, what's happening with md0? Because md0 is online right now (albeit without it's sdb1 mirror, only with sda1) and the root filesystem is mounted on md0. So if I do an assemble, will it interrupt the filesystem in any way, or can I safely do it while the server is running with users connected to it? (which is 24/7 unfortunately).

vishesh 11-07-2013 07:51 AM

I think its better to stop md device. What is output of mdadm --detail /dev/md0

Thanks

reano 11-07-2013 07:56 AM

I can't stop the device :(
Also, the / root filesystem is mounted on md0.

The output you requested is:

Code:

/dev/md0:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:45 2012
    Raid Level : raid1
    Array Size : 1464710976 (1396.86 GiB 1499.86 GB)
  Used Dev Size : 1464710976 (1396.86 GiB 1499.86 GB)
  Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Thu Nov  7 15:55:07 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

          Name : lia:0  (local to host lia)
          UUID : eb302d19:ff70c7bf:401d63af:ed042d59
        Events : 26216

    Number  Major  Minor  RaidDevice State
      0      8        1        0      active sync  /dev/sda1
      1      0        0        1      removed

What's interesting is that it shows sdb1 as removed, not failed or spare.

vishesh 11-07-2013 08:04 AM

I think if its showing removed that following command should recover

mdadm /dev/md0 -a /dev/sdb1

Thanks

reano 11-07-2013 08:13 AM

Quote:

Originally Posted by vishesh (Post 5060032)
I think if its showing removed that following command should recover

mdadm /dev/md0 -a /dev/sdb1

Thanks

Is that not the same as mdadm /dev/md0 --add /dev/sdb1 ? If so, that doesn't work (see above for the error message I got when I tried that).

vishesh 11-07-2013 08:51 AM

I am unable to see any error message above . Ideally for replacing device , I follow

mdadm /dev/md0 -f /dev/sdb1
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md0 -a /dev/sdb1

Thanks

reano 11-07-2013 08:54 AM

Quote:

Originally Posted by vishesh (Post 5060054)
I am unable to see any error message above . Ideally for replacing device , I follow

mdadm /dev/md0 -f /dev/sdb1
mdadm /dev/md0 -r /dev/sdb1
mdadm /dev/md0 -a /dev/sdb1

Thanks

Ah sorry, seems I didn't post the result in the original post. When I do the -a (or --add) I get the following:

Code:

mdadm: add new device failed for /dev/sdb1 as 2: Invalid argument
I haven't tried to do it in that order (first f, then r, then a). I can't damage anything further than it already is, can I? Keep in mind that sda1 and sdb1 (in other words, md0) contains the root filesystem. At the moment md0 seems to run only on sda1 (and not on sdb1). At least the server is still running.

reano 11-08-2013 12:34 AM

Got the following results:

Code:

root@lia:~# mdadm /dev/md0 -f /dev/sdb1
mdadm: set device faulty failed for /dev/sdb1:  No such device

root@lia:~# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot remove failed for /dev/sdb1: No such device or address

root@lia:~# mdadm /dev/md0 -a /dev/sdb1
mdadm: add new device failed for /dev/sdb1 as 2: Invalid argument


reano 11-13-2013 01:13 AM

Hate to bump a thread, but I still need help with this. Any advice, anyone? :)

evo2 11-13-2013 01:18 AM

Hi,

mdadm doesn't seem to see /dev/sdb1 at all. I suggest you investigate its status with other tools. Eg fdisk

Evo2.

reano 11-13-2013 02:32 AM

Quote:

Originally Posted by evo2 (Post 5063532)
Hi,

mdadm doesn't seem to see /dev/sdb1 at all. I suggest you investigate its status with other tools. Eg fdisk

Evo2.

Ok sure. What exactly do you want me to check? sdb1 looks normal when I check the partition tables, compared to the other partitions/drives.

vishesh 11-13-2013 05:57 AM

Do below command showing any output?
Quote:

ls -l /dev|grep sdb1
Thanks

reano 11-13-2013 06:17 AM

Quote:

Originally Posted by vishesh (Post 5063634)
Do below command showing any output?


Thanks

Yes, it shows:
Code:

brw-rw---- 1 root disk      8,  17 Nov  8 08:33 sdb1

Ser Olmy 11-13-2013 03:09 PM

Check the /dev directory and see if the /dev/sdb1 device actually exists. If it doesn't, you'll need to recreate it with fdisk, parted or whatever tool you prefer to use to manage partitions.

If the device is missing but the partition seems to be there, try running partprobe then check the /dev directory again.

reano 11-14-2013 12:26 AM

Quote:

Originally Posted by Ser Olmy (Post 5063870)
Check the /dev directory and see if the /dev/sdb1 device actually exists. If it doesn't, you'll need to recreate it with fdisk, parted or whatever tool you prefer to use to manage partitions.

If the device is missing but the partition seems to be there, try running partprobe then check the /dev directory again.

sdb1 is in the /dev directory :)

Ser Olmy 11-14-2013 03:09 AM

The next step is to figure out why mdadm returns an error message when you try to reference /dev/sdb1. See what
Code:

mdadm --examine /dev/sdb1
has to say about that partition.

According to /proc/mdstat (in your first post), /deb/md0 only has one member, /dev/sda1. As long as the /dev/sdb1 partition is valid and identical in size to /dev/sda1 (which fdisk -l /dev/sdb or parted /dev/sdb print should be able to confirm or deny), you should be able to re-add /dev/sdb1 with the following command:
Code:

mdadm --manage /dev/md0 --add /dev/sdb1
You may also want to check the health of /dev/sdb with:
Code:

smartctl -a /dev/sdb
In particular, examine the Reallocated_Sector_Count and Current_Pending_Sector attributes. There has to be a reason why the partition was dropped from the RAID device.

reano 11-14-2013 03:19 AM

mdadm --examine /dev/sdb1 gives the following:

Code:

mdadm: No md superblock detected on /dev/sdb1.
parted /dev/sda print:

Code:

Model: ATA ST3000VX000-9YW1 (scsi)
Disk /dev/sda: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start  End    Size    File system  Name  Flags
 1      1049kB  1500GB  1500GB  ext4              raid
 2      1500GB  1525GB  24,6GB                    raid
 3      1525GB  3001GB  1476GB                    raid

parted /dev/sdb print:

Code:

Model: ATA ST3000VX000-9YW1 (scsi)
Disk /dev/sdb: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start  End    Size    File system  Name  Flags
 1      1049kB  1500GB  1500GB                    raid
 2      1500GB  1525GB  24,6GB                    raid
 3      1525GB  3001GB  1476GB                    raid

mdadm --manage /dev/md0 --add /dev/sdb1:

Code:

mdadm: add new device failed for /dev/sdb1 as 2: Invalid argument
smartctl -a /dev/sdb:

Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:    ST3000VX000-9YW166
Serial Number:    W1F0VJ95
LU WWN Device Id: 5 000c50 052d36854
Firmware Version: CV13
User Capacity:    3Â*000Â*592Â*982Â*016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Nov 14 11:17:32 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  111  099  006    Pre-fail  Always      -      34112212
  3 Spin_Up_Time            0x0003  095  095  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      97
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  082  060  030    Pre-fail  Always      -      189255078
  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      8951
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      97
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  032  032  000    Old_age  Always      -      68
188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      12885098499
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      264
190 Airflow_Temperature_Cel 0x0022  063  059  045    Old_age  Always      -      37 (Min/Max 34/38)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      89
193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      1191
194 Temperature_Celsius    0x0022  037  041  000    Old_age  Always      -      37 (0 16 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      15
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      15
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

SMART Error Log Version: 1
ATA Error Count: 3338 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3338 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 09 08 00 00  Error: UNC at LBA = 0x00000809 = 2057

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 00  7d+02:46:03.171  READ DMA
  27 00 00 00 00 00 e0 00  7d+02:46:03.159  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  7d+02:46:03.151  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  7d+02:46:03.103  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  7d+02:46:03.087  READ NATIVE MAX ADDRESS EXT

Error 3337 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 09 08 00 00  Error: UNC at LBA = 0x00000809 = 2057

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 00  7d+02:46:03.171  READ DMA
  27 00 00 00 00 00 e0 00  7d+02:46:03.159  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  7d+02:46:03.151  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  7d+02:46:03.103  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  7d+02:46:03.087  READ NATIVE MAX ADDRESS EXT

Error 3336 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 09 08 00 00  Error: UNC at LBA = 0x00000809 = 2057

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 00  7d+02:46:02.819  READ DMA
  27 00 00 00 00 00 e0 00  7d+02:46:02.807  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  7d+02:46:02.799  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  7d+02:46:02.727  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  7d+02:46:02.707  READ NATIVE MAX ADDRESS EXT

Error 3335 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 09 08 00 00  Error: UNC at LBA = 0x00000809 = 2057

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 00  7d+02:46:02.819  READ DMA
  27 00 00 00 00 00 e0 00  7d+02:46:02.807  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  7d+02:46:02.799  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  7d+02:46:02.727  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  7d+02:46:02.707  READ NATIVE MAX ADDRESS EXT

Error 3334 occurred at disk power-on lifetime: 8951 hours (372 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 09 08 00 00  Error: UNC at LBA = 0x00000809 = 2057

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 00  7d+02:46:02.436  READ DMA
  27 00 00 00 00 00 e0 00  7d+02:46:02.435  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  7d+02:46:02.427  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  7d+02:46:02.371  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  7d+02:46:02.363  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      8933        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote:

Originally Posted by Ser Olmy (Post 5064133)
The next step is to figure out why mdadm returns an error message when you try to reference /dev/sdb1. See what
Code:

mdadm --examine /dev/sdb1
has to say about that partition.

According to /proc/mdstat (in your first post), /deb/md0 only has one member, /dev/sda1. As long as the /dev/sdb1 partition is valid and identical in size to /dev/sda1 (which fdisk -l /dev/sdb or parted /dev/sdb print should be able to confirm or deny), you should be able to re-add /dev/sdb1 with the following command:
Code:

mdadm --manage /dev/md0 --add /dev/sdb1
You may also want to check the health of /dev/sdb with:
Code:

smartctl -a /dev/sdb
In particular, examine the Reallocated_Sector_Count and Current_Pending_Sector attributes. There has to be a reason why the partition was dropped from the RAID device.


Ser Olmy 11-14-2013 03:53 AM

The /dev/sdb device has 15 "pending" sectors, meaning it's waiting for a write command to reallocate whose sectors. While 15 is not an alarmingly large number, the fact that they're all "pending" rather than "reallocated", suggests the defects may have appeared at approximately the same time, which could be an indication of drive failure. You should run badblocks -ns on /dev/sdb1 before proceeding, and check the S.M.A.R.T. status for /dev/sdb again when it's done.

The "invalid argument" error is usually caused by a non-removed device. The "--add" command is only valid if the array is online and can be expanded, or if a device has been removed. However, the output from mdadm --detail /dev/md0 in post #8 does indeed show the second device as "removed". Strange.

Could you port the output from:
Code:

ls /sys/block/md0/md/
Also, do any messages appear in the logs when you try to add back /dev/sdb1 to the array?

reano 11-14-2013 04:06 AM

I can't run the badblocks at the moment, as it uses all the server resources and totally kills the network users logged onto it :/

Which log file specifically do you want me to check when i try add the device back to md0?

Output of ls /sys/block/md0/md/ is:
Code:

array_size      layout            reshape_position    sync_max
array_state      level            resync_start        sync_min
bitmap          max_read_errors  safe_mode_delay      sync_speed
bitmap_set_bits  metadata_version  suspend_hi          sync_speed_max
chunk_size      mismatch_cnt      suspend_lo          sync_speed_min
component_size  new_dev          sync_action
degraded        raid_disks        sync_completed
dev-sda1        rd0              sync_force_parallel

Quote:

Originally Posted by Ser Olmy (Post 5064162)
The /dev/sdb device has 15 "pending" sectors, meaning it's waiting for a write command to reallocate whose sectors. While 15 is not an alarmingly large number, the fact that they're all "pending" rather than "reallocated", suggests the defects may have appeared at approximately the same time, which could be an indication of drive failure. You should run badblocks -ns on /dev/sdb1 before proceeding, and check the S.M.A.R.T. status for /dev/sdb again when it's done.

The "invalid argument" error is usually caused by a non-removed device. The "--add" command is only valid if the array is online and can be expanded, or if a device has been removed. However, the output from mdadm --detail /dev/md0 in post #8 does indeed show the second device as "removed". Strange.

Could you port the output from:
Code:

ls /sys/block/md0/md/
Also, do any messages appear in the logs when you try to add back /dev/sdb1 to the array?


Ser Olmy 11-14-2013 04:35 PM

Do a tail -f /var/log/messages in one terminal window while you attempt to add /dev/sdb1 to md0 in another.

The files in /sys/block/md0/md confirms that there's no reference from md0 to anything other than /dev/sda1. It should be possible to add another device/partition.

reano 11-15-2013 12:40 AM

I don't have a /var/log/messages, but I did do a tail on the syslog, and it showed the following while trying to add the partition back to md0:

Code:

Nov 15 08:38:25 lia kernel: [674827.954967] ata1: EH complete
Nov 15 08:38:25 lia kernel: [674828.187410] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 15 08:38:25 lia kernel: [674828.187416] ata1.01: failed command: READ DMA
Nov 15 08:38:25 lia kernel: [674828.187422] ata1.01: cmd c8/00:08:08:08:00/00:00:00:00:00/f0 tag 0 dma 4096 in
Nov 15 08:38:25 lia kernel: [674828.187424]          res 51/40:00:09:08:00/00:00:00:00:00/10 Emask 0x9 (media error)
Nov 15 08:38:25 lia kernel: [674828.187427] ata1.01: status: { DRDY ERR }
Nov 15 08:38:25 lia kernel: [674828.187430] ata1.01: error: { UNC }
Nov 15 08:38:25 lia kernel: [674828.242074] ata1.00: configured for UDMA/133


Ser Olmy 11-15-2013 05:47 AM

It seems the md driver ran into one of the bad sectors on the drive. If you can't run badblocks, try using dd to overwrite the partition with zeros:
Code:

dd if=/dev/zero of=/dev/sdb1 bs=8192 oflag=direct
That should trigger a reallocation of any bad sectors.

The "oflag=direct" parameter bypasses the cache, and has the effect of slowing the process down significantly. With any luck, the other users won't notice anything. The real reason it's there, however, is to prevent cache management from doing read-ahead, as that would cause it to attempt to read the bad sectors, which in turn would cause dd to abort.

reano 11-15-2013 06:15 AM

Quote:

Originally Posted by Ser Olmy (Post 5064910)
It seems the md driver ran into one of the bad sectors on the drive. If you can't run badblocks, try using dd to overwrite the partition with zeros:
Code:

dd if=/dev/zero of=/dev/sdb1 bs=8192 oflag=direct
That should trigger a reallocation of any bad sectors.

The "oflag=direct" parameter bypasses the cache, and has the effect of slowing the process down significantly. With any luck, the other users won't notice anything. The real reason it's there, however, is to prevent cache management from doing read-ahead, as that would cause it to attempt to read the bad sectors, which in turn would cause dd to abort.

Thank you! I'll do that now. Once it's done, is there anything specific I need to do BEFORE trying to --add the sdb1 partition to md0 again?

Ser Olmy 11-15-2013 06:21 AM

Quote:

Originally Posted by reano (Post 5064924)
Thank you! I'll do that now. Once it's done, is there anything specific I need to do BEFORE trying to --add the sdb1 partition to md0 again?

I'd check the S.M.A.R.T. status again. The Current_Pending_Sector counter should show a number lower than 15 (0, ideally).

Other than that, there's nothing in particular you need to consider before attempting to add the partition to the RAID array again.

reano 11-15-2013 10:25 AM

Quote:

Originally Posted by Ser Olmy (Post 5064927)
I'd check the S.M.A.R.T. status again. The Current_Pending_Sector counter should show a number lower than 15 (0, ideally).

Other than that, there's nothing in particular you need to consider before attempting to add the partition to the RAID array again.

Thank you so much. After the process completed, there were 0 pending sectors. I then successfully re-added sdb1 to md0, and it is now busy with recovery!
I just hope the recovery process completes without any issues. I'll let you know!

One thing that strikes me as a bit weird though: in all the arrays, the disks are ID's 0 and 1. But on md0, sda1 is id 0, and the re-added sdb1 is id 2, not id 1. Does that make a difference?

Output of cat /proc/mdstat:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  4.3% (63596480/1464710976) finish=318.5min speed=73315K/sec

md1 : active raid1 sda2[0] sdb2[1]
      24006528 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>


reano 11-15-2013 11:04 AM

Seems I spoke to soon. About 20% into the recovery process sdb1 failed again, and this time sdb2 in md1 also failed. Seems the whole sdb drive is busted.

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1](F)
      24006528 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>

I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?

Meh.

Ser Olmy 11-15-2013 11:17 AM

Quote:

Originally Posted by reano (Post 5065077)
I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?

Now you know why RAID array drives should be clearly labeled...

Assuming these are SATA drives, sdb is (most likely) the drive connected to the SATA port with the second lowest number that's in use.

Since it's no longer part of the array, it will be the only inactive drive. If the drives have on-board activity LEDs (few do these days), you should be able to tell by just looking.

You could try spinning the drive down with hdparm -Y. You should be able to hear it power down.

reano 11-15-2013 11:27 AM

Quote:

Originally Posted by Ser Olmy (Post 5065083)
Now you know why RAID array drives should be clearly labeled...

Yup, lesson learned indeed.

I'll try the hdparm on Monday. Is there a way to power it back up, as I might need to toggle it a few times to find the right one - there are 6 drives in that box :S

Also, before I power down the drive and replace it, I'll need to remove sdb1, sdb2 and sdb3 from md0, md1 and md2. Do I just do that normally, as in:

Code:

mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md0 --remove /dev/sdb1

mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md1 --remove /dev/sdb2


mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm --manage /dev/md2 --remove /dev/sdb3

Or is there another way to go about it?

Ser Olmy 11-15-2013 11:33 AM

No, that's how you do it; first "--fail", then "--remove".

(And any kind of disk access should wake a sleeping drive, like running fdisk or parted, or dd'ing a few blocks to /dev/null.)

reano 11-15-2013 11:38 AM

Now something very concerning started happening.
I wanted to install a package using apt-get. I got the following error:

Code:

root@lia:~# apt-get install gdisk
-bash: /usr/bin/apt-get: Input/output error

So then I did:

Code:

root@lia:~# smartctl -a /dev/sdb

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:              /0:0:1:0
Product:
User Capacity:        600Â*332Â*565Â*813Â*390Â*450 bytes [600 PB]
Logical block size:  774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error

But, I also get the following on sda:

Code:

root@lia:~# smartctl -a /dev/sda

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:              /0:0:0:0
Product:
User Capacity:        600Â*332Â*565Â*813Â*390Â*450 bytes [600 PB]
Logical block size:  774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error

What the heck....? Is sda failing now as well?

cat /proc/mdstat still shows:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1](F)
      24006528 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Indicating that only sdb failed, with 2 out of the 3 partitions down so far.

Ser Olmy 11-15-2013 11:41 AM

The faulty drive may be blocking the controller. An emergency reboot may be in order here.

You also need to check the S.M.A.R.T. status of all remaining drives asap.

(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)

reano 11-15-2013 11:48 AM

Quote:

Originally Posted by Ser Olmy (Post 5065112)
The faulty drive may be blocking the controller. An emergency reboot may be in order here.

You also need to check the S.M.A.R.T. status of all remaining drives asap.

(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)

Normal reboot console command? Or is there another way to do an emergency reboot?

The other drives:

sdc has 0 pending sectors.
sdd has 24 pending sectors, and shows "Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)"
sde has 0 pending sectors, but also shows "Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)"
sdf has 0 pending sectors.

This spells crisis to me :/ Of the 6 drives, 3 seems to be busted, one on each array - and I have no idea what's going on with sda.

Ser Olmy 11-15-2013 11:55 AM

The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.

Have you been checking these arrays regularly? I run
Code:

echo check > /sys/devices/virtual/block/<md device>/md/sync_action
at least once a week. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.

reano 11-15-2013 11:56 AM

Quote:

Originally Posted by Ser Olmy (Post 5065122)
The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.

Have you been checking these arrays regularly? I run
Code:

echo check > /sys/devices/virtual/block/<md device>/md/sync_action
at least weekly. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.

Do I need to remove any drives before rebooting? The server is offsite, and I'm accessing it remotely at the moment.
EDIT: Just lost remote connection. Server is still up as it's still routing traffic, but I can't access it via SSH anymore.

Ser Olmy 11-15-2013 12:02 PM

It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.

Does this server have built-in remote access functionality, or do you have to rely on the OS?

Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.

reano 11-15-2013 12:04 PM

Quote:

Originally Posted by Ser Olmy (Post 5065124)
It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.

Does this server have built-in remote access functionality, or do you have to rely on the OS?

Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.

See my edit. I'll have to drive in and shutdown -h, then locate sdb, disconnect it, and start her back up. Anything else I need to know before going in? (if the server doesn't come back up I won't have internet access from the premises... talk about a double-crisis)

Ser Olmy 11-15-2013 12:07 PM

Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.

reano 11-15-2013 12:09 PM

Quote:

Originally Posted by Ser Olmy (Post 5065129)
Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.

Will do. If possible, I'll still try to remove sdb1,2,3 from md0,1,2 before shutting down and removing the drive. Right?

Ser Olmy 11-15-2013 12:10 PM

Sounds like a plan.

reano 11-15-2013 12:14 PM

Also, any idea why we're seeing errors on 3 drives instead of 1 (refer to post #37)? Normally I'd suspect a RAID controller, but this is software raid.

Ser Olmy 11-15-2013 12:23 PM

Must be the drives. There's no way other hardware or software can make a drive report "pending sectors" via S.M.A.R.T. Media error is the only possibility.

reano 11-15-2013 01:55 PM

Ok, I'm on the premises. I turned off the server (it was hanging with alot of error messages, like you predicted). I removed sdb (I looked for the serial number on the drive casing, to match the serial number as reported by smartctl on sdb).

Booted up, and it's running now. But here's the really strange thing:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid1 sdc1[1] sdb1[0]
      2930133824 blocks super 1.2 [2/2] [UU]
     
md0 : active raid1 sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]
     
md1 : active (auto-read-only) raid1 sda2[0]
      24006528 blocks super 1.2 [2/1] [U_]
     
md2 : active raid1 sda3[0]
      1441268544 blocks super 1.2 [2/1] [U_]
     
md4 : active raid1 sdd2[0] sde2[1]
      2929939264 blocks super 1.2 [2/2] [UU]
     
unused devices: <none>

But I definitely removed sdb. But now sdf is missing, and sdb is there. Also, that mdstat doesn't make any sense, look at it closely... Looks like sdf became sdb, or something. Compare this with how mdstat used to look before:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2] sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  4.3% (63596480/1464710976) finish=318.5min speed=73315K/sec

md1 : active raid1 sda2[0] sdb2[1]
      24006528 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Btw, my swap partition runs on md1, but it shows as auto read-only?

EDIT: Here are the md device details:

Code:

/dev/md0:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:45 2012
    Raid Level : raid1
    Array Size : 1464710976 (1396.86 GiB 1499.86 GB)
  Used Dev Size : 1464710976 (1396.86 GiB 1499.86 GB)
  Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 22:08:29 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

          Name : lia:0  (local to host lia)
          UUID : eb302d19:ff70c7bf:401d63af:ed042d59
        Events : 513922

    Number  Major  Minor  RaidDevice State
      0      8        1        0      active sync  /dev/sda1
      1      0        0        1      removed

Code:

/dev/md1:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:50 2012
    Raid Level : raid1
    Array Size : 24006528 (22.89 GiB 24.58 GB)
  Used Dev Size : 24006528 (22.89 GiB 24.58 GB)
  Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 15:36:33 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

          Name : lia:1  (local to host lia)
          UUID : 1f8dff14:bc317bcb:d3587249:9ffc0b42
        Events : 58

    Number  Major  Minor  RaidDevice State
      0      8        2        0      active sync  /dev/sda2
      1      0        0        1      removed

Code:

/dev/md2:
        Version : 1.2
  Creation Time : Sat Dec 29 17:09:59 2012
    Raid Level : raid1
    Array Size : 1441268544 (1374.50 GiB 1475.86 GB)
  Used Dev Size : 1441268544 (1374.50 GiB 1475.86 GB)
  Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 21:42:19 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

          Name : lia:2  (local to host lia)
          UUID : 543b8db0:660e4e18:d388dec8:b9fe81cb
        Events : 103

    Number  Major  Minor  RaidDevice State
      0      8        3        0      active sync  /dev/sda3
      1      0        0        1      removed

Code:

/dev/md3:
        Version : 1.2
  Creation Time : Sat Dec 29 17:10:04 2012
    Raid Level : raid1
    Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
  Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 21:48:23 2013
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

          Name : lia:3  (local to host lia)
          UUID : 2a35faa7:b076b115:f2e45d70:e9e0f885
        Events : 72

    Number  Major  Minor  RaidDevice State
      0      8      17        0      active sync  /dev/sdb1
      1      8      33        1      active sync  /dev/sdc1

Code:

/dev/md4:
        Version : 1.2
  Creation Time : Sat Dec 29 17:10:15 2012
    Raid Level : raid1
    Array Size : 2929939264 (2794.21 GiB 3000.26 GB)
  Used Dev Size : 2929939264 (2794.21 GiB 3000.26 GB)
  Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Nov 15 22:08:50 2013
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

          Name : lia:4  (local to host lia)
          UUID : 18cafde6:cdd0d6ad:e80fe7e2:a346e157
        Events : 196

    Number  Major  Minor  RaidDevice State
      0      8      50        0      active sync  /dev/sdd2
      1      8      66        1      active sync  /dev/sde2

I'll post the smartctl stats in the next post, this one is getting a bit long.

reano 11-15-2013 02:22 PM

...continued from previous post...

sda:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:    ST3000VX000-9YW166
Serial Number:    Z1F0SK6G
LU WWN Device Id: 5 000c50 04dcd6768
Firmware Version: CV13
User Capacity:    3*000*592*982*016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:17:20 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      157046752
  3 Spin_Up_Time            0x0003  095  095  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      97
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  082  060  030    Pre-fail  Always      -      193004742
  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      8982
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      97
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      1
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      896
190 Airflow_Temperature_Cel 0x0022  063  055  045    Old_age  Always      -      37 (Min/Max 32/37)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      89
193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      326
194 Temperature_Celsius    0x0022  037  045  000    Old_age  Always      -      37 (0 16 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sdb:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:    ST3000VX000-9YW166
Serial Number:    Z1F0SN8B
LU WWN Device Id: 5 000c50 04dcd6911
Firmware Version: CV13
User Capacity:    3*000*592*982*016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:18:19 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      142164536
  3 Spin_Up_Time            0x0003  095  094  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      97
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  070  060  030    Pre-fail  Always      -      11890152
  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      8983
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      97
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      114
190 Airflow_Temperature_Cel 0x0022  068  059  045    Old_age  Always      -      32 (Min/Max 31/33)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      89
193 Load_Cycle_Count        0x0032  090  090  000    Old_age  Always      -      21074
194 Temperature_Celsius    0x0022  032  041  000    Old_age  Always      -      32 (0 15 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sdc:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:    ST3000VX000-9YW166
Serial Number:    Z1F0SML8
LU WWN Device Id: 5 000c50 04dcd1e8e
Firmware Version: CV13
User Capacity:    3*000*592*982*016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:19:47 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  114  098  006    Pre-fail  Always      -      66583096
  3 Spin_Up_Time            0x0003  095  094  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      97
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  070  060  030    Pre-fail  Always      -      11716429
  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      8981
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      97
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  001  001  000    Old_age  Always      -      263
188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      1
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      314
190 Airflow_Temperature_Cel 0x0022  066  058  045    Old_age  Always      -      34 (Min/Max 31/34)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      89
193 Load_Cycle_Count        0x0032  090  090  000    Old_age  Always      -      20770
194 Temperature_Celsius    0x0022  034  042  000    Old_age  Always      -      34 (0 14 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      24
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      24
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

SMART Error Log Version: 1
ATA Error Count: 248 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 248 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:28.428  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:28.427  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:28.419  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:28.339  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:28.331  READ NATIVE MAX ADDRESS EXT

Error 247 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:25.531  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:25.531  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:25.522  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:25.443  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:25.435  READ NATIVE MAX ADDRESS EXT

Error 246 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:22.671  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:22.670  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:22.662  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:22.590  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:22.574  READ NATIVE MAX ADDRESS EXT

Error 245 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:19.803  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:19.802  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:19.794  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:19.714  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:19.706  READ NATIVE MAX ADDRESS EXT

Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  40d+22:06:16.934  READ DMA EXT
  27 00 00 00 00 00 e0 00  40d+22:06:16.933  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  40d+22:06:16.925  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  40d+22:06:16.846  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  40d+22:06:16.830  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

...continue on next post...

reano 11-15-2013 02:23 PM

...continued from previous post...

sdd:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:    ST3000VX000-9YW166
Serial Number:    Z1F0R4EY
LU WWN Device Id: 5 000c50 04dc4a62e
Firmware Version: CV13
User Capacity:    3 000 592 982 016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:20:51 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  116  099  006    Pre-fail  Always      -      117184888
  3 Spin_Up_Time            0x0003  095  094  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      97
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  077  060  030    Pre-fail  Always      -      53608287
  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      8988
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      97
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  046  046  000    Old_age  Always      -      54
188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0
189 High_Fly_Writes        0x003a  020  020  000    Old_age  Always      -      80
190 Airflow_Temperature_Cel 0x0022  064  059  045    Old_age  Always      -      36 (Min/Max 31/36)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      89
193 Load_Cycle_Count        0x0032  063  063  000    Old_age  Always      -      75120
194 Temperature_Celsius    0x0022  036  041  000    Old_age  Always      -      36 (0 16 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

SMART Error Log Version: 1
ATA Error Count: 54 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 54 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:44.943  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:44.942  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:44.894  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:44.886  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:44.886  READ NATIVE MAX ADDRESS EXT

Error 53 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:42.102  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:42.101  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:42.094  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:42.094  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:42.093  READ NATIVE MAX ADDRESS EXT

Error 52 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:39.290  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:39.289  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:39.216  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:39.209  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:39.209  READ NATIVE MAX ADDRESS EXT

Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:36.421  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:36.420  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:36.364  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:36.356  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:36.356  READ NATIVE MAX ADDRESS EXT

Error 50 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  12d+21:36:33.584  READ DMA EXT
  27 00 00 00 00 00 e0 00  12d+21:36:33.583  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  12d+21:36:33.576  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  12d+21:36:33.575  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  12d+21:36:33.575  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sde:
Code:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:    ST3000VX000-9YW166
Serial Number:    Z1F0SMES
LU WWN Device Id: 5 000c50 04dcd3ad1
Firmware Version: CV13
User Capacity:    3 000 592 982 016 bytes [3,00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 15 22:21:56 2013 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (  1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (  2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000f  113  099  006    Pre-fail  Always      -      54310680
  3 Spin_Up_Time            0x0003  095  094  000    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      97
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000f  077  060  030    Pre-fail  Always      -      55382099
  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      8988
 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      97
184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0
189 High_Fly_Writes        0x003a  001  001  000    Old_age  Always      -      393
190 Airflow_Temperature_Cel 0x0022  066  062  045    Old_age  Always      -      34 (Min/Max 29/34)
191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      89
193 Load_Cycle_Count        0x0032  064  064  000    Old_age  Always      -      73794
194 Temperature_Celsius    0x0022  034  040  000    Old_age  Always      -      34 (0 15 0 0)
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


reano 11-15-2013 02:24 PM

...continued from previous post...

As you can see from all the stats in the above 3 posts, the sdb device doesn't have the original sdb serial number. Seems sdf renamed itself to sdb. Bizarre...


All times are GMT -5. The time now is 05:42 PM.