LinuxQuestions.org - [SOLVED] help determining RAID fail event

- Slackware (https://www.linuxquestions.org/questions/slackware-14/)

- - help determining RAID fail event (https://www.linuxquestions.org/questions/slackware-14/help-determining-raid-fail-event-4175672595/)

help determining RAID fail event

I recently received an email stating a "FAIL event has been detected on md device /dev/md0" whereby it looks like it went back to resyncing everything from the good drive, but upon checking smartctl there does not seem to indicate anything issues with the bad drive so I'm wondering if this needs to be investigated further or if it was a false alarm. if it is, how can I get rid of the flag in mdadm that states the drive is faulty?

here's some relevant information when this event was triggered

Code:

A Fail event has been detected on md device /dev/md0.



The device /dev/sdc1 may be involved.



Contents of /proc/mdstat:

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]

md0 : active raid1 sdb1[0] sdc1[1](F)

      2930162552 blocks super 1.2 [2/1] [U_]

      [=====>...............]  resync = 28.3% (830933760/2930162552) finish=7399.8min speed=4728K/sec



unused devices: <none>



Contents of mdadm --detail

/dev/md0:

        Version : 1.2

  Creation Time : Tue Aug  2 10:36:53 2011

    Raid Level : raid1

    Array Size : 2930162552 (2794.42 GiB 3000.49 GB)

  Used Dev Size : 2930162552 (2794.42 GiB 3000.49 GB)

  Raid Devices : 2

  Total Devices : 2

    Persistence : Superblock is persistent



    Update Time : Fri Apr  3 00:17:37 2020

          State : active, degraded, resyncing

 Active Devices : 1

Working Devices : 1

 Failed Devices : 1

  Spare Devices : 0



  Resync Status : 99% complete



          Name : defiant:0  (local to host defiant)

          UUID : a043a371:530d4c99:daed879a:904c0e11

        Events : 1382



    Number  Major  Minor  RaidDevice State

      0      8      17        0      active sync  /dev/sdb1

      1      8      33        1      faulty  /dev/sdc1



Contents of dmesg:

[ 6034.544287] ata2.01: configured for UDMA/133

[ 6036.687034] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

[ 6038.086733] ata1.00: configured for UDMA/133

[ 6046.837423] PM: resume of devices complete after 13097.208 msecs

[ 6046.861968] Restarting tasks ... done.

[ 6046.905404] md: checkpointing resync of md0.

[ 6046.970037] RAID1 conf printout:

[ 6046.970045]  --- wd:1 rd:2

[ 6046.970053]  disk 0, wo:0, o:1, dev:sdb1

[ 6046.970062]  disk 1, wo:1, o:0, dev:sdc1

also, here's what smartctl has to say for the drive in question

Code:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.208] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:    Hitachi Deskstar 5K3000

Device Model:    Hitachi HDS5C3030ALA630

Serial Number:    MJ1321YNG17PEA

LU WWN Device Id: 5 000cca 228c0913e

Firmware Version: MEAOA580

User Capacity:    3,000,592,982,016 bytes [3.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    5700 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Sun Apr  5 12:30:53 2020 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED



General SMART Values:

Offline data collection status:  (0x84)        Offline data collection activity

                                        was suspended by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0)        The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                (38166) seconds.

Offline data collection

capabilities:                          (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003)        Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01)        Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:          (  1) minutes.

Extended self-test routine

recommended polling time:          ( 636) minutes.

SCT capabilities:                (0x003d)        SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0

  2 Throughput_Performance  0x0005  134  134  054    Pre-fail  Offline      -      109

  3 Spin_Up_Time            0x0007  220  220  024    Pre-fail  Always      -      273 (Average 362)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      86

  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  132  132  020    Pre-fail  Offline      -      32

  9 Power_On_Hours          0x0012  092  092  000    Old_age  Always      -      58611

 10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      48

192 Power-Off_Retract_Count 0x0032  099  099  000    Old_age  Always      -      1817

193 Load_Cycle_Count        0x0012  099  099  000    Old_age  Always      -      1817

194 Temperature_Celsius    0x0002  193  193  000    Old_age  Always      -      31 (Min/Max 19/48)

196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0



SMART Error Log Version: 1

No Errors Logged



SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

From I read, you don't have smartctld configured to run self tests on your devices. You aren't the only one; mine was mis-configured not that long ago (<1year).

That's not good; device self-tests are how you are given early notice that something's about to break or just broke. Take a look at the comments in /etc/smartd.conf to see what's possible.

I've got (pretty much) the following in my /etc/smartd.conf:

Code:

DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)

When I look at my devices, I'll see that tests have been run...

Code:

# smartctl -a /dev/sda

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.217] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Model Family:    Seagate Constellation ES (SATA 6Gb/s)

Device Model:    ST1000NM0011

Serial Number:    Z1N4CMG8

LU WWN Device Id: 5 000c50 0640164d4

Firmware Version: SN03

User Capacity:    1,000,204,886,016 bytes [1.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7202 rpm

Form Factor:      3.5 inches

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Sun Apr  5 17:12:48 2020 CDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.



General SMART Values:

Offline data collection status:  (0x82)        Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0)        The previous self-test routine completed

                                        without error or no self-test has ever 

                                        been run.

Total time to complete Offline 

data collection:                (  600) seconds.

Offline data collection

capabilities:                          (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003)        Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01)        Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine 

recommended polling time:          (  1) minutes.

Extended self-test routine

recommended polling time:          ( 151) minutes.

Conveyance self-test routine

recommended polling time:          (  2) minutes.

SCT capabilities:                (0x10bd)        SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.



SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  083  063  044    Pre-fail  Always      -      203399523

  3 Spin_Up_Time            0x0003  095  095  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      296

  5 Reallocated_Sector_Ct  0x0033  099  099  036    Pre-fail  Always      -      43

  7 Seek_Error_Rate        0x000f  087  060  030    Pre-fail  Always      -      512552029

  9 Power_On_Hours          0x0032  037  037  000    Old_age  Always      -      55417

 10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

 12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      296

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  099  000    Old_age  Always      -      3

189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0

190 Airflow_Temperature_Cel 0x0022  059  045  045    Old_age  Always  In_the_past 41 (Min/Max 37/42)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      122

193 Load_Cycle_Count        0x0032  097  097  000    Old_age  Always      -      6257

194 Temperature_Celsius    0x0022  041  055  000    Old_age  Always      -      41 (0 20 0 0 0)

195 Hardware_ECC_Recovered  0x001a  119  099  000    Old_age  Always      -      203399523

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0



SMART Error Log Version: 1

No Errors Logged



SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    55401        -

# 2  Extended offline    Completed without error      00%    55382        -

# 3  Short offline      Completed without error      00%    55377        -

# 4  Short offline      Completed without error      00%    55353        -

# 5  Short offline      Completed without error      00%    55329        -

# 6  Short offline      Completed without error      00%    55305        -

# 7  Short offline      Completed without error      00%    55281        -

# 8  Short offline      Completed without error      00%    55257        -

# 9  Short offline      Completed without error      00%    55233        -

#10  Extended offline    Completed without error      00%    55213        -

#11  Short offline      Completed without error      00%    55209        -

#12  Short offline      Completed without error      00%    55185        -

#13  Short offline      Completed without error      00%    55161        -

#14  Short offline      Completed without error      00%    55138        -

#15  Short offline      Completed without error      00%    55114        -

#16  Short offline      Completed without error      00%    55090        -

#17  Short offline      Completed without error      00%    55066        -

#18  Extended offline    Completed without error      00%    55046        -

#19  Short offline      Completed without error      00%    55042        -

#20  Short offline      Completed without error      00%    55018        -

#21  Short offline      Completed without error      00%    54994        -



SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Notice the "SMART Self-test log structure revision number 1" portion of the report!

EDIT: You can trigger such tests by hand as well; it's best to have the daemon do that work for you.

My experience at work with hardware RAID controllers is the failure tag won't disappear until the array is fully rebuilt.

That does not address what triggered the original alert. The date stamp on the email should provide help where to start looking in the system logs.

I agree with Richard Cranium to configure cron with automated smartd self tests. I do this with servers at work. I scheduled daily short tests and weekly long tests. Another weekly cron job grabs the smartctl output and sends an email.

I do likewise with some basic weekly RAID emails.

Disclaimer: I am not a RAID guru and don't play one on TV.

Quote:

Originally Posted by upnort (Post 6108232)

I agree with Richard Cranium to configure cron with automated smartd self tests. I do this with servers at work. I scheduled daily short tests and weekly long tests. Another weekly cron job grabs the smartctl output and sends an email.

Actually, smartd will do its own scheduling; it may or may not use cron internally (I honestly haven't bothered to look) and you most certainly can configure smartd to email you on its own. (I left that bit out of the DEVICESCAN string that I provided.)

Quote:

Originally Posted by Richard Cranium (Post 6108218)

I've got (pretty much) the following in my /etc/smartd.conf:

Code:

DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)

I'll admit that my brain is a little fuzzed out right now and I'm struggling making sense of the conf file and your line. Would you mind breaking down what your DEVICESCAN line is doing? If not, I can dig through things a bit more once my mind is in a better place.

@Richard Cranium, thank you for mentioning smartd.conf as I did not have that configured properly at all to monitor my devices!

@upnort, good idea on matching the email's timestamp and the system logs. I've attached a snippet of it in case, but I've been experimenting with rtcwake to put the system to "freeze" state until 23:59.59 whereby the system would come out of sleep and after about 10 minutes begin to do an rsnapshot backup. I'm wondering if the sdc didn't wake up quick enough for the raid that it decided to resync. that said, according to mdadm --detail the raid has been rebuilt, however the drive is still in fault mode...

Code:

Apr  4 00:00:11 defiant kernel: [ 5853.713271] Freezing user space processes ... (elapsed 0.001 seconds) done.

Apr  4 00:00:11 defiant kernel: [ 5853.714589] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.

Apr  4 00:00:11 defiant kernel: [ 5853.717291] sd 1:0:1:0: [sdc] Synchronizing SCSI cache

Apr  4 00:00:11 defiant kernel: [ 5853.717463] parport_pc 00:04: disabled

Apr  4 00:00:11 defiant kernel: [ 5853.717842] serial 00:03: disabled

Apr  4 00:00:11 defiant kernel: [ 5853.718304] serial 00:02: disabled

Apr  4 00:00:11 defiant kernel: [ 5853.718705] sd 1:0:0:0: [sdb] Synchronizing SCSI cache

Apr  4 00:00:11 defiant kernel: [ 5853.718853] sd 0:0:0:0: [sda] Synchronizing SCSI cache

Apr  4 00:00:11 defiant kernel: [ 5853.718995] sd 0:0:0:0: [sda] Stopping disk

Apr  4 00:00:11 defiant kernel: [ 5853.719313] e1000e: EEE TX LPI TIMER: 00000000

Apr  4 00:00:11 defiant kernel: [ 5853.719339] e1000e: EEE TX LPI TIMER: 00000000

Apr  4 00:00:11 defiant kernel: [ 5853.729423] sd 1:0:1:0: [sdc] Stopping disk

Apr  4 00:00:11 defiant kernel: [ 5853.729546] sd 1:0:0:0: [sdb] Stopping disk

Apr  4 00:00:11 defiant kernel: [ 5858.063093] sd 1:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08

Apr  4 00:00:11 defiant kernel: [ 5858.063100] sd 1:0:1:0: [sdc] tag#0 Sense Key : 0xb [current] [descriptor] 

Apr  4 00:00:11 defiant kernel: [ 5858.063105] sd 1:0:1:0: [sdc] tag#0 ASC=0x47 ASCQ=0x0 

Apr  4 00:00:11 defiant kernel: [ 5858.063112] sd 1:0:1:0: [sdc] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 63 0e 33 80 00 00 05 80 00 00

Apr  4 00:00:11 defiant kernel: [ 5858.063223] md: md0: resync interrupted.

Apr  4 00:00:11 defiant kernel: [ 5858.074053] PM: suspend of devices complete after 4357.854 msecs

Apr  4 00:00:11 defiant kernel: [ 5858.085051] PM: late suspend of devices complete after 10.986 msecs

Apr  4 00:00:11 defiant kernel: [ 5858.086220] pcieport 0000:00:1c.4: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086307] pcieport 0000:00:1c.2: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086479] uhci_hcd 0000:00:1d.2: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086485] ehci-pci 0000:00:1d.7: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086605] uhci_hcd 0000:00:1d.1: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086635] uhci_hcd 0000:00:1d.0: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086787] uhci_hcd 0000:00:1a.2: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086794] ehci-pci 0000:00:1a.7: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086862] uhci_hcd 0000:00:1a.1: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.086906] uhci_hcd 0000:00:1a.0: System wakeup enabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 5858.097453] PM: noirq suspend of devices complete after 12.365 msecs

Apr  4 00:00:11 defiant kernel: [ 5920.977004] Task dump for CPU 1:

Apr  4 00:00:11 defiant kernel: [ 5920.977004] swapper/1      R  running task        0    0      1 0x00200000

Apr  4 00:00:11 defiant kernel: [ 5920.977004] Task dump for CPU 2:

Apr  4 00:00:11 defiant kernel: [ 5920.977004] swapper/2      R  running task        0    0      1 0x00200000

Apr  4 00:00:11 defiant kernel: [ 5980.979004] Task dump for CPU 1:

Apr  4 00:00:11 defiant kernel: [ 5980.979004] swapper/1      R  running task        0    0      1 0x00200000

Apr  4 00:00:11 defiant kernel: [ 5980.979004] Task dump for CPU 2:

Apr  4 00:00:11 defiant kernel: [ 5980.979004] swapper/2      R  running task        0    0      1 0x00200000

Apr  4 00:00:11 defiant kernel: [ 6033.716048] sd 1:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00

Apr  4 00:00:11 defiant kernel: [ 6033.716055] sd 1:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 63 0e 3e 00 00 00 04 00 00 00

Apr  4 00:00:11 defiant kernel: [ 6033.722081] uhci_hcd 0000:00:1a.0: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.722162] uhci_hcd 0000:00:1a.1: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.722233] uhci_hcd 0000:00:1a.2: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.722575] uhci_hcd 0000:00:1d.0: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.722664] uhci_hcd 0000:00:1d.1: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.722761] uhci_hcd 0000:00:1d.2: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.733456] pcieport 0000:00:1c.4: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.733522] ehci-pci 0000:00:1d.7: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.733625] ehci-pci 0000:00:1a.7: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.733898] PM: noirq resume of devices complete after 12.017 msecs

Apr  4 00:00:11 defiant kernel: [ 6033.740204] PM: early resume of devices complete after 6.220 msecs

Apr  4 00:00:11 defiant kernel: [ 6033.740634] rtc_cmos 00:01: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.740650] pcieport 0000:00:1c.2: System wakeup disabled by ACPI

Apr  4 00:00:11 defiant kernel: [ 6033.743172] serial 00:02: activated

Apr  4 00:00:11 defiant kernel: [ 6033.745655] serial 00:03: activated

Apr  4 00:00:11 defiant kernel: [ 6033.753640] parport_pc 00:04: activated

Apr  4 00:00:11 defiant kernel: [ 6033.817883] sd 0:0:0:0: [sda] Starting disk

Apr  4 00:00:11 defiant kernel: [ 6033.817885] sd 1:0:0:0: [sdb] Starting disk

Apr  4 00:00:11 defiant kernel: [ 6033.817918] sd 1:0:1:0: [sdc] Starting disk

Apr  4 00:00:11 defiant kernel: [ 6034.068735] ata3: SATA link down (SStatus 0 SControl 300)

Apr  4 00:00:11 defiant kernel: [ 6034.079464] ata4: SATA link down (SStatus 0 SControl 300)

Apr  4 00:00:11 defiant kernel: [ 6034.520073] ata2.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 00:00:11 defiant kernel: [ 6034.520086] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 00:00:11 defiant kernel: [ 6034.523151] ata2.01: ACPI cmd ef/03:45:00:00:00:b0 (SET FEATURES) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.523156] ata2.01: ACPI cmd ef/03:0c:00:00:00:b0 (SET FEATURES) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.523319] ata2.01: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.524070] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  4 00:00:11 defiant kernel: [ 6034.524082] ata1.01: SATA link down (SStatus 0 SControl 300)

Apr  4 00:00:11 defiant kernel: [ 6034.527159] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.527164] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.529150] ata2.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.529155] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.529310] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.529381] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Apr  4 00:00:11 defiant kernel: [ 6034.538288] ata2.00: configured for UDMA/133

Apr  4 00:00:11 defiant kernel: [ 6034.544287] ata2.01: configured for UDMA/133

Apr  4 00:00:11 defiant kernel: [ 6036.687034] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Apr  4 00:00:11 defiant kernel: [ 6038.086733] ata1.00: configured for UDMA/133

Apr  4 00:00:11 defiant kernel: [ 6046.837423] PM: resume of devices complete after 13097.208 msecs

Apr  4 00:00:11 defiant kernel: [ 6046.861968] Restarting tasks ... done.

Apr  4 00:00:11 defiant kernel: [ 6046.905404] md: checkpointing resync of md0.

Apr  4 00:00:11 defiant kernel: [ 6046.975363] md: resync of RAID array md0

Apr  4 00:00:11 defiant kernel: [ 6046.975370] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.

Apr  4 00:00:11 defiant kernel: [ 6046.975374] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.

Apr  4 00:00:11 defiant kernel: [ 6046.975381] md: using 128k window, over a total of 2930162552k.

Apr  4 00:00:11 defiant kernel: [ 6046.975385] md: resuming resync of md0 from checkpoint.

Apr  4 00:00:11 defiant kernel: [ 6046.975719] md: md0: resync done.

Apr  5 00:00:25 defiant kernel: [ 6912.543143] PM: Syncing filesystems ... done.

Apr  5 00:00:25 defiant kernel: [ 6913.061333] Freezing user space processes ... (elapsed 0.001 seconds) done.

Apr  5 00:00:25 defiant kernel: [ 6913.062642] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.

Apr  5 00:00:25 defiant kernel: [ 6913.065267] parport_pc 00:04: disabled

Apr  5 00:00:25 defiant kernel: [ 6913.065421] sd 1:0:1:0: [sdc] Synchronizing SCSI cache

Apr  5 00:00:25 defiant kernel: [ 6913.065585] sd 1:0:0:0: [sdb] Synchronizing SCSI cache

Apr  5 00:00:25 defiant kernel: [ 6913.065616] sd 1:0:1:0: [sdc] Stopping disk

Apr  5 00:00:25 defiant kernel: [ 6913.065795] sd 0:0:0:0: [sda] Synchronizing SCSI cache

Apr  5 00:00:25 defiant kernel: [ 6913.065807] sd 1:0:0:0: [sdb] Stopping disk

Apr  5 00:00:25 defiant kernel: [ 6913.065825] serial 00:03: disabled

Apr  5 00:00:25 defiant kernel: [ 6913.065964] sd 0:0:0:0: [sda] Stopping disk

Apr  5 00:00:25 defiant kernel: [ 6913.066412] serial 00:02: disabled

Apr  5 00:00:25 defiant kernel: [ 6913.066499] e1000e: EEE TX LPI TIMER: 00000000

Apr  5 00:00:25 defiant kernel: [ 6913.066522] e1000e: EEE TX LPI TIMER: 00000000

Apr  5 00:00:25 defiant kernel: [ 6914.455057] PM: suspend of devices complete after 1390.900 msecs

Apr  5 00:00:25 defiant kernel: [ 6914.466057] PM: late suspend of devices complete after 10.988 msecs

Apr  5 00:00:25 defiant kernel: [ 6914.466957] pcieport 0000:00:1c.4: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467380] uhci_hcd 0000:00:1d.2: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467382] ehci-pci 0000:00:1d.7: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467459] uhci_hcd 0000:00:1d.1: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467512] uhci_hcd 0000:00:1d.0: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467514] pcieport 0000:00:1c.2: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467634] ehci-pci 0000:00:1a.7: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467689] uhci_hcd 0000:00:1a.2: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467745] uhci_hcd 0000:00:1a.1: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.467787] uhci_hcd 0000:00:1a.0: System wakeup enabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.478256] PM: noirq suspend of devices complete after 12.166 msecs

Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1a.0: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1a.1: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1a.2: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1d.0: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1d.1: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1d.2: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.489155] ehci-pci 0000:00:1d.7: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.489159] ehci-pci 0000:00:1a.7: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.490424] pcieport 0000:00:1c.4: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.490655] PM: noirq resume of devices complete after 12.303 msecs

Apr  5 00:00:25 defiant kernel: [ 6914.491450] PM: early resume of devices complete after 0.685 msecs

Apr  5 00:00:25 defiant kernel: [ 6914.492071] pcieport 0000:00:1c.2: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.497160] rtc_cmos 00:01: System wakeup disabled by ACPI

Apr  5 00:00:25 defiant kernel: [ 6914.503622] serial 00:02: activated

Apr  5 00:00:25 defiant kernel: [ 6914.510152] serial 00:03: activated

Apr  5 00:00:25 defiant kernel: [ 6914.512565] parport_pc 00:04: activated

Apr  5 00:00:25 defiant kernel: [ 6914.565401] sd 0:0:0:0: [sda] Starting disk

Apr  5 00:00:25 defiant kernel: [ 6914.565403] sd 1:0:0:0: [sdb] Starting disk

Apr  5 00:00:25 defiant kernel: [ 6914.565440] sd 1:0:1:0: [sdc] Starting disk

Apr  5 00:00:25 defiant kernel: [ 6914.816723] ata4: SATA link down (SStatus 0 SControl 300)

Apr  5 00:00:25 defiant kernel: [ 6914.827459] ata3: SATA link down (SStatus 0 SControl 300)

Apr  5 00:00:25 defiant kernel: [ 6915.271076] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  5 00:00:25 defiant kernel: [ 6915.271086] ata1.01: SATA link down (SStatus 0 SControl 300)

Apr  5 00:00:25 defiant kernel: [ 6915.273074] ata2.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  5 00:00:25 defiant kernel: [ 6915.273087] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Apr  5 00:00:25 defiant kernel: [ 6915.274152] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.274157] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.276145] ata2.01: ACPI cmd ef/03:45:00:00:00:b0 (SET FEATURES) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.276150] ata2.01: ACPI cmd ef/03:0c:00:00:00:b0 (SET FEATURES) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.276323] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.276399] ata2.01: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.282146] ata2.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.282151] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.282310] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Apr  5 00:00:25 defiant kernel: [ 6915.291309] ata2.00: configured for UDMA/133

Apr  5 00:00:25 defiant kernel: [ 6915.297302] ata2.01: configured for UDMA/133

Apr  5 00:00:25 defiant kernel: [ 6917.413041] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Apr  5 00:00:25 defiant kernel: [ 6918.709566] ata1.00: configured for UDMA/133

Apr  5 00:00:25 defiant kernel: [ 6940.057352] PM: resume of devices complete after 25565.892 msecs

Apr  5 00:00:25 defiant kernel: [ 6940.069055] Restarting tasks ... done.

mdadm --detail

Code:

{~}# mdadm --detail /dev/md0                                                    

/dev/md0:

        Version : 1.2

  Creation Time : Tue Aug  2 10:36:53 2011

    Raid Level : raid1

    Array Size : 2930162552 (2794.42 GiB 3000.49 GB)

  Used Dev Size : 2930162552 (2794.42 GiB 3000.49 GB)

  Raid Devices : 2

  Total Devices : 2

    Persistence : Superblock is persistent



    Update Time : Sun Apr  5 21:11:27 2020

          State : clean, degraded 

 Active Devices : 1

Working Devices : 1

 Failed Devices : 1

  Spare Devices : 0



          Name : defiant:0  (local to host defiant)

          UUID : a043a371:530d4c99:daed879a:904c0e11

        Events : 2181



    Number  Major  Minor  RaidDevice State

      0      8      17        0      active sync  /dev/sdb1

      2      0        0        2      removed



      1      8      33        -      faulty  /dev/sdc1

Quote:

Originally Posted by bassmadrigal (Post 6108244)

One of the comments in /etc/smartd.conf is ...

Code:

# First ATA/SATA or SCSI/SAS disk.  Monitor all attributes, enable

# automatic online data collection, automatic Attribute autosave, and

# start a short self-test every day between 2-3am, and a long self test

# Saturdays between 3-4am.

#/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03))

Further down, there's...

Code:

# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE.

# PLEASE SEE THE smartd.conf MAN PAGE FOR DETAILS

#

#  -d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N

#  -T TYPE set the tolerance to one of: normal, permissive

#  -o VAL  Enable/disable automatic offline tests (on/off)

#  -S VAL  Enable/disable attribute autosave (on/off)

#  -n MODE No check. MODE is one of: never, sleep, standby, idle

#  -H      Monitor SMART Health Status, report if failed

#  -l TYPE Monitor SMART log.  Type is one of: error, selftest

#  -f      Monitor for failure of any 'Usage' Attributes

#  -m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f

#  -M TYPE Modify email warning behavior (see man page)

#  -s REGE Start self-test when type/date matches regular expression (see man page)

#  -p      Report changes in 'Prefailure' Normalized Attributes

#  -u      Report changes in 'Usage' Normalized Attributes

#  -t      Equivalent to -p and -u Directives

#  -r ID  Also report Raw values of Attribute ID with -p, -u or -t

#  -R ID  Track changes in Attribute ID Raw value with -p, -u or -t

#  -i ID  Ignore Attribute ID for -f Directive

#  -I ID  Ignore Attribute ID for -p, -u or -t Directive

#  -C ID  Report if Current Pending Sector count non-zero

#  -U ID  Report if Offline Uncorrectable count non-zero

#  -W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit

#  -v N,ST Modifies labeling of Attribute N (see man page)

#  -a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198

#  -F TYPE Use firmware bug workaround. Type is one of: none, samsung

#  -P TYPE Drive-specific presets: use, ignore, show, showall

#    #      Comment: text after a hash sign is ignored

#    \      Line continuation character

So, all put together, this...

Code:

DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)

...means

For every smart capable device in the system:

Monitor SMART Health Status, report if failed
Monitor for failure of any 'Usage' Attributes
Report changes in 'Prefailure' Normalized Attributes
Report changes in 'Usage' Normalized Attributes
Monitor the error SMART log
Monitor the selftest SMART log
Report if Current Pending Sector count non-zero
Report if Offline Uncorrectable count non-zero

(all of that is what -a breaks down to)
PLUS..

Enable automatic offline tests (i.e -o on)
Enable attribute autosave (i.e., -S on)
No check on standby and don't bother to tell me that you skipped a test because of this (i.e., -n standby,q)
Run a short self-test between 2-3am every day and a long self-test every Saturday between 3-4am (i.e., -s (S/../.././02|L/../../6/03)

EDIT: I forgot to mention that smartd will also log to syslog if something bad happened in a test. If you already have a log-scraping tool that looks for things to alarm, then you can use that to send a warning email. smartd can also be configured to email warnings as well.

Quote:

Originally Posted by Richard Cranium (Post 6108324)

One of the comments in /etc/smartd.conf is ...

Code:

# First ATA/SATA or SCSI/SAS disk.  Monitor all attributes, enable

# automatic online data collection, automatic Attribute autosave, and

# start a short self-test every day between 2-3am, and a long self test

# Saturdays between 3-4am.

#/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03))

Further down, there's...

Code:

# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE.

# PLEASE SEE THE smartd.conf MAN PAGE FOR DETAILS

#

#  -d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N

#  -T TYPE set the tolerance to one of: normal, permissive

#  -o VAL  Enable/disable automatic offline tests (on/off)

#  -S VAL  Enable/disable attribute autosave (on/off)

#  -n MODE No check. MODE is one of: never, sleep, standby, idle

#  -H      Monitor SMART Health Status, report if failed

#  -l TYPE Monitor SMART log.  Type is one of: error, selftest

#  -f      Monitor for failure of any 'Usage' Attributes

#  -m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f

#  -M TYPE Modify email warning behavior (see man page)

#  -s REGE Start self-test when type/date matches regular expression (see man page)

#  -p      Report changes in 'Prefailure' Normalized Attributes

#  -u      Report changes in 'Usage' Normalized Attributes

#  -t      Equivalent to -p and -u Directives

#  -r ID  Also report Raw values of Attribute ID with -p, -u or -t

#  -R ID  Track changes in Attribute ID Raw value with -p, -u or -t

#  -i ID  Ignore Attribute ID for -f Directive

#  -I ID  Ignore Attribute ID for -p, -u or -t Directive

#  -C ID  Report if Current Pending Sector count non-zero

#  -U ID  Report if Offline Uncorrectable count non-zero

#  -W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit

#  -v N,ST Modifies labeling of Attribute N (see man page)

#  -a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198

#  -F TYPE Use firmware bug workaround. Type is one of: none, samsung

#  -P TYPE Drive-specific presets: use, ignore, show, showall

#    #      Comment: text after a hash sign is ignored

#    \      Line continuation character

So, all put together, this...

Code:

DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)

...means

For every smart capable device in the system:

Monitor SMART Health Status, report if failed
Monitor for failure of any 'Usage' Attributes
Report changes in 'Prefailure' Normalized Attributes
Report changes in 'Usage' Normalized Attributes
Monitor the error SMART log
Monitor the selftest SMART log
Report if Current Pending Sector count non-zero
Report if Offline Uncorrectable count non-zero

(all of that is what -a breaks down to)
PLUS..

Enable automatic offline tests (i.e -o on)
Enable attribute autosave (i.e., -S on)
No check on standby and don't bother to tell me that you skipped a test because of this (i.e., -n standby,q)
Run a short self-test between 2-3am every day and a long self-test every Saturday between 3-4am (i.e., -s (S/../.././02|L/../../6/03)

Awesome! That was really in depth and much easier to read than the conf file when I was looking at it yesterday. Thanks! I'll likely get this implemented when I get home tonight.

Keep in mind there is a simple

Code:

DEVICESCAN

line near the top of the config file by default. Update that one or comment it out.
Otherwise, when you put your new DEVICESCAN line at the bottom of the file, the config file parser doesn't bother to look at it. Zero guesses on how I know that tidbit.

Quote:

I'm wondering if the sdc didn't wake up quick enough for the raid that it decided to resync.

Quote:

however the drive is still in fault mode

The state is listed as clean, degraded. Look into how to remove the degraded state.

Quote:

Originally Posted by upnort (Post 6108446)

Good point. I don't know. While Linux software RAID is well tested for a couple of decades, suspend might be outside the scope of the design. Primarily RAID targets systems running 24/7 -- business continuity. Something that suspends, like a laptop or home server that is powered down nightly, might not be an expected use case for software RAID. Might want to poke around the web.

what you said definitely makes sense! thank you, I'll poke some more to see if anyone has any info on this!

The state is listed as clean, degraded. Look into how to remove the degraded state.

working on that now!

One more thing, I've put this into my /etc/rc.d/rc.local (which is more RAID related than smartd related)...

Code:

# Increase timeouts for all non-ERC drives.

# see https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

for i in /dev/sd? ; do

    if smartctl -l scterc,70,70 ${i} > /dev/null ; then

        echo -n ${i} " is good "

    else

        echo 180 > /sys/block/${i/\/dev\/}/device/timeout

        echo -n ${i} " is  bad "

    fi;

    smartctl -i ${i} | egrep "(Device Model|Product:)"

    blockdev --setra 1024 ${i}

done

The link in the code block explains the issue.

Quote:

Originally Posted by Richard Cranium (Post 6108554)

One more thing, I've put this into my /etc/rc.d/rc.local (which is more RAID related than smartd related)...

Code:

# Increase timeouts for all non-ERC drives.

# see https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

for i in /dev/sd? ; do

    if smartctl -l scterc,70,70 ${i} > /dev/null ; then

        echo -n ${i} " is good "

    else

        echo 180 > /sys/block/${i/\/dev\/}/device/timeout

        echo -n ${i} " is  bad "

    fi;

    smartctl -i ${i} | egrep "(Device Model|Product:)"

    blockdev --setra 1024 ${i}

done

The link in the code block explains the issue.

thanks for this, definitely was not aware of this! while it definitely can be used on my desktop, it unfortunately does not work for the older drives I have on my backup server! this does shed more light on what happened with my drive on this system.

To be honest, I'm fairly certain that someone else on the forum mentioned the timeout mismatch; I don't remember who did so or when they did.

While it's possible that I ran across this while reading the RAID wiki, the mere fact that this is the first time I'm posted anything about it, tells me that someone else beat me to the punch. Hopefully they'll show up and tell us when I saw said mention.