Do I have a bad drive or not?

digdogger · 10-04-2011, 09:10 PM

Hello. I'm currently in the process of building a Linux Software RAID 5 array and am getting allot of conflicting information about whether or not one of the drives is bad.

I have 4 of these drives: Seagate Barracuda LP 1.5 TB (ST31500541AS)

I downloaded and ran the SeaTools (on Windows XP) and ran the "Long Generic" test on all 4 of the drives and they all passed.

I ran:

Code:

badblocks -wvs -o /root/badblocks.txt /dev/sde

And one of the drives did show some badblocks, but upon inspecting the inside of the computer case, the cables had become unplugged. I ran a second badblocks check on the same drive and it passed.

I have also run:

Code:

smartctl -t long /dev/sde

and, after the test finished, I ran:

Code:

smartctl -a /dev/sde

and can see the output:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       171         -
# 2  Short offline       Completed without error       00%       157         -

Also, in the output of "smartctl -a /dev/sde" is "SMART overall-health self-assessment test result: PASSED", which is a good sign. However, the difference between the output of "smartctl -a /dev/sde" and for other drives is that in the "SMART Error Log Version: 1" section, there are some errors listed, but the output for "smartctl -a /dev/sdd" and for the other drives, says "No Errors Logged". Here is an example of the errors on /dev/sde :

Code:

Error 2855 occurred at disk power-on lifetime: 88 hours (3 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      08:01:03.097  READ DMA EXT
  27 00 00 00 00 00 e0 00      08:01:03.096  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      08:01:03.093  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      08:01:03.090  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      08:01:03.060  READ NATIVE MAX ADDRESS EXT

I followed along with http://www.linuxhomenetworking.com/w..._Software_RAID to create the RAID array.

So, I have created one MSDOS partition on each of these drives, using up all the space on the drive, so that the output of fdisk -l looks like this for all the drives:

Code:

Disk /dev/sde: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1      182401  1465136001   fd  Linux raid autodetect

I have used this command to create the RAID array:

Code:

mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 --spare-devices=0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

The RAID array is currently building, so I am getting this output from "cat /proc/mdstat" :

Code:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sde1[4] sdd1[2] sdc1[1] sdb1[0]
      4395407808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [===>.................]  recovery = 15.0% (221120564/1465135936) finish=805.6min speed=25734K/sec
      
unused devices: <none>

I'm quite confused by all this. Why is sde1 listed as "sde1[4]" instead of "sde1[3]"? Why is it showing "[4/3] [UUU_]" instead of "[4/4] [UUUU]"? If it is detecting a drive (sde) as failed, why does it not show a (F) beside it, as indicated by https://www-304.ibm.com/support/docv...d=isg3T1011259 ?

And I guess the big question is, should I attempt to return this drive, or perhaps I just need to clear that SMART error log and software RAID will allow me to use that drive?

Thanks.

rch · 10-04-2011, 10:17 PM

The _ represent a missing disk- not a failure. What does /var/log/messages say? What does mdadm --detail /dev/md0 return?

digdogger · 10-05-2011, 04:02 PM

Thanks for the reply, rch. The plot thickens, though. The RAID build process has completed, and now the output of "cat /proc/mdstat" is:

Code:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sde1[3] sdd1[2] sdc1[1] sdb1[0]
      4395407808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

So it looks like everything is ok ... Perhaps I jumped the gun in making this forum posting. To answer your questions:

/var/log/messages (removing everything that doesn't have to do with the RAID build):

Code:

Oct  4 16:34:34 velma kernel: md: bind<sdb1>
Oct  4 16:34:34 velma kernel: md: bind<sdc1>
Oct  4 16:34:34 velma kernel: md: bind<sdd1>
Oct  4 16:34:34 velma kernel: md: bind<sde1>
Oct  4 16:34:34 velma kernel: raid5: device sdd1 operational as raid disk 2
Oct  4 16:34:34 velma kernel: raid5: device sdc1 operational as raid disk 1
Oct  4 16:34:34 velma kernel: raid5: device sdb1 operational as raid disk 0
Oct  4 16:34:34 velma kernel: raid5: allocated 4262kB for md0
Oct  4 16:34:34 velma kernel: raid5: raid level 5 set md0 active with 3 out of 4 devices, algorithm 2
Oct  4 16:34:34 velma kernel: RAID5 conf printout:
Oct  4 16:34:34 velma kernel:  --- rd:4 wd:3 fd:1
Oct  4 16:34:34 velma kernel:  disk 0, o:1, dev:sdb1
Oct  4 16:34:34 velma kernel:  disk 1, o:1, dev:sdc1
Oct  4 16:34:34 velma kernel:  disk 2, o:1, dev:sdd1
Oct  4 16:34:34 velma kernel: RAID5 conf printout:
Oct  4 16:34:34 velma kernel:  --- rd:4 wd:3 fd:1
Oct  4 16:34:34 velma kernel:  disk 0, o:1, dev:sdb1
Oct  4 16:34:34 velma kernel:  disk 1, o:1, dev:sdc1
Oct  4 16:34:34 velma kernel:  disk 2, o:1, dev:sdd1
Oct  4 16:34:34 velma kernel:  disk 3, o:1, dev:sde1
Oct  4 16:34:34 velma kernel: md: syncing RAID array md0
Oct  4 16:34:34 velma kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
Oct  4 16:34:34 velma kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
Oct  4 16:34:34 velma kernel: md: using 128k window, over a total of 1465135936 blocks.
Oct  5 08:39:39 velma kernel: md: md0: sync done.
Oct  5 08:39:39 velma kernel: RAID5 conf printout:
Oct  5 08:39:39 velma kernel:  --- rd:4 wd:4 fd:0
Oct  5 08:39:39 velma kernel:  disk 0, o:1, dev:sdb1
Oct  5 08:39:39 velma kernel:  disk 1, o:1, dev:sdc1
Oct  5 08:39:39 velma kernel:  disk 2, o:1, dev:sdd1
Oct  5 08:39:39 velma kernel:  disk 3, o:1, dev:sde1

And "mdadm --detail /dev/md0" :

Code:

/dev/md0:
        Version : 0.90
  Creation Time : Tue Oct  4 16:34:34 2011
     Raid Level : raid5
     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Oct  5 08:39:39 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : def18c72:165523b5:6544cc2b:a8748b14
         Events : 0.2

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

I found this: http://blog.ringerc.id.au/2010/04/us...cron-raid.html and I'm *definitely* going to follow the advice on that page and am running:

Code:

echo "check" > /sys/block/md0/md/sync_action

I'll be setting that up as a cron job too, and I highly recommend to anybody who stumbles across this posting to do the same if you have a software RAID setup.

Does anybody know why, in /var/log/messages above, I have "RAID5 conf printout" twice in a row, but with different data below it? Does anybody know the meaning of "--- rd:4 wd:3 fd:1" ?

Thanks.

rch · 10-05-2011, 04:30 PM

It means 3 working , 1 failed. See how it changes to rd:4 wd:4 fd:0. The fd:1 was the missing one.

digdogger · 10-17-2011, 08:03 PM

After a whole bunch of time consuming trial and error, I eventually figured out the issue. A change in OS is what I needed. I was using CentOS 5.7, which is using kernel 2.6.18-274.3.1 as of Oct 17, 2011 . I switched to Fedora 14, which is using kernel 2.6.35.14-97 as of Oct 17, 2011, which I think is what made the difference. After switching to Fedora 14 but still sticking with the same hardware, I was able to build the RAID array, and keep the output of /proc/mdstat saying "[UUUU]" even after running "echo check > /sys/block/md0/md/sync_action".