LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 04-05-2020, 11:38 AM   #1
dimm0k
Member
 
Registered: May 2008
Location: Brooklyn ZOO
Distribution: Slackware64 14.2
Posts: 545

Rep: Reputation: 54
help determining RAID fail event


I recently received an email stating a "FAIL event has been detected on md device /dev/md0" whereby it looks like it went back to resyncing everything from the good drive, but upon checking smartctl there does not seem to indicate anything issues with the bad drive so I'm wondering if this needs to be investigated further or if it was a false alarm. if it is, how can I get rid of the flag in mdadm that states the drive is faulty?

here's some relevant information when this event was triggered
Code:
A Fail event has been detected on md device /dev/md0.

The device /dev/sdc1 may be involved.

Contents of /proc/mdstat:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid1 sdb1[0] sdc1[1](F)
      2930162552 blocks super 1.2 [2/1] [U_]
      [=====>...............]  resync = 28.3% (830933760/2930162552) finish=7399.8min speed=4728K/sec

unused devices: <none>

Contents of mdadm --detail
/dev/md0:
        Version : 1.2
  Creation Time : Tue Aug  2 10:36:53 2011
     Raid Level : raid1
     Array Size : 2930162552 (2794.42 GiB 3000.49 GB)
  Used Dev Size : 2930162552 (2794.42 GiB 3000.49 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Apr  3 00:17:37 2020
          State : active, degraded, resyncing
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

  Resync Status : 99% complete

           Name : defiant:0  (local to host defiant)
           UUID : a043a371:530d4c99:daed879a:904c0e11
         Events : 1382

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      faulty   /dev/sdc1

Contents of dmesg:
[ 6034.544287] ata2.01: configured for UDMA/133
[ 6036.687034] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[ 6038.086733] ata1.00: configured for UDMA/133
[ 6046.837423] PM: resume of devices complete after 13097.208 msecs
[ 6046.861968] Restarting tasks ... done.
[ 6046.905404] md: checkpointing resync of md0.
[ 6046.970037] RAID1 conf printout:
[ 6046.970045]  --- wd:1 rd:2
[ 6046.970053]  disk 0, wo:0, o:1, dev:sdb1
[ 6046.970062]  disk 1, wo:1, o:0, dev:sdc1
also, here's what smartctl has to say for the drive in question
Code:
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.208] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 5K3000
Device Model:     Hitachi HDS5C3030ALA630
Serial Number:    MJ1321YNG17PEA
LU WWN Device Id: 5 000cca 228c0913e
Firmware Version: MEAOA580
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5700 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Apr  5 12:30:53 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(38166) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 636) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       109
  3 Spin_Up_Time            0x0007   220   220   024    Pre-fail  Always       -       273 (Average 362)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       86
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       32
  9 Power_On_Hours          0x0012   092   092   000    Old_age   Always       -       58611
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       48
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1817
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1817
194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Min/Max 19/48)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Old 04-05-2020, 05:14 PM   #2
Richard Cranium
Senior Member
 
Registered: Apr 2009
Location: Carrollton, Texas
Distribution: Slackware64 14.2
Posts: 3,714

Rep: Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060
From I read, you don't have smartctld configured to run self tests on your devices. You aren't the only one; mine was mis-configured not that long ago (<1year).

That's not good; device self-tests are how you are given early notice that something's about to break or just broke. Take a look at the comments in /etc/smartd.conf to see what's possible.

I've got (pretty much) the following in my /etc/smartd.conf:
Code:
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)
When I look at my devices, I'll see that tests have been run...

Code:
# smartctl -a /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.217] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES (SATA 6Gb/s)
Device Model:     ST1000NM0011
Serial Number:    Z1N4CMG8
LU WWN Device Id: 5 000c50 0640164d4
Firmware Version: SN03
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7202 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr  5 17:12:48 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  600) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 151) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x10bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   063   044    Pre-fail  Always       -       203399523
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       296
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       43
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       512552029
  9 Power_On_Hours          0x0032   037   037   000    Old_age   Always       -       55417
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       296
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       3
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   045   045    Old_age   Always   In_the_past 41 (Min/Max 37/42)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       122
193 Load_Cycle_Count        0x0032   097   097   000    Old_age   Always       -       6257
194 Temperature_Celsius     0x0022   041   055   000    Old_age   Always       -       41 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   119   099   000    Old_age   Always       -       203399523
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     55401         -
# 2  Extended offline    Completed without error       00%     55382         -
# 3  Short offline       Completed without error       00%     55377         -
# 4  Short offline       Completed without error       00%     55353         -
# 5  Short offline       Completed without error       00%     55329         -
# 6  Short offline       Completed without error       00%     55305         -
# 7  Short offline       Completed without error       00%     55281         -
# 8  Short offline       Completed without error       00%     55257         -
# 9  Short offline       Completed without error       00%     55233         -
#10  Extended offline    Completed without error       00%     55213         -
#11  Short offline       Completed without error       00%     55209         -
#12  Short offline       Completed without error       00%     55185         -
#13  Short offline       Completed without error       00%     55161         -
#14  Short offline       Completed without error       00%     55138         -
#15  Short offline       Completed without error       00%     55114         -
#16  Short offline       Completed without error       00%     55090         -
#17  Short offline       Completed without error       00%     55066         -
#18  Extended offline    Completed without error       00%     55046         -
#19  Short offline       Completed without error       00%     55042         -
#20  Short offline       Completed without error       00%     55018         -
#21  Short offline       Completed without error       00%     54994         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Notice the "SMART Self-test log structure revision number 1" portion of the report!

EDIT: You can trigger such tests by hand as well; it's best to have the daemon do that work for you.

Last edited by Richard Cranium; 04-05-2020 at 05:15 PM.
 
Old 04-05-2020, 06:14 PM   #3
upnort
Senior Member
 
Registered: Oct 2014
Distribution: Slackware, Proxmox, Debian, CentOS
Posts: 1,637

Rep: Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971
My experience at work with hardware RAID controllers is the failure tag won't disappear until the array is fully rebuilt.

That does not address what triggered the original alert. The date stamp on the email should provide help where to start looking in the system logs.

I agree with Richard Cranium to configure cron with automated smartd self tests. I do this with servers at work. I scheduled daily short tests and weekly long tests. Another weekly cron job grabs the smartctl output and sends an email.

I do likewise with some basic weekly RAID emails.

Disclaimer: I am not a RAID guru and don't play one on TV.
 
1 members found this post helpful.
Old 04-05-2020, 06:22 PM   #4
Richard Cranium
Senior Member
 
Registered: Apr 2009
Location: Carrollton, Texas
Distribution: Slackware64 14.2
Posts: 3,714

Rep: Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060
Quote:
Originally Posted by upnort View Post
I agree with Richard Cranium to configure cron with automated smartd self tests. I do this with servers at work. I scheduled daily short tests and weekly long tests. Another weekly cron job grabs the smartctl output and sends an email.
Actually, smartd will do its own scheduling; it may or may not use cron internally (I honestly haven't bothered to look) and you most certainly can configure smartd to email you on its own. (I left that bit out of the DEVICESCAN string that I provided.)
 
Old 04-05-2020, 06:51 PM   #5
bassmadrigal
LQ Guru
 
Registered: Nov 2003
Location: West Jordan, UT, USA
Distribution: Slackware
Posts: 7,240

Rep: Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932
Quote:
Originally Posted by Richard Cranium View Post
I've got (pretty much) the following in my /etc/smartd.conf:
Code:
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)
I'll admit that my brain is a little fuzzed out right now and I'm struggling making sense of the conf file and your line. Would you mind breaking down what your DEVICESCAN line is doing? If not, I can dig through things a bit more once my mind is in a better place.
 
Old 04-05-2020, 08:55 PM   #6
dimm0k
Member
 
Registered: May 2008
Location: Brooklyn ZOO
Distribution: Slackware64 14.2
Posts: 545

Original Poster
Rep: Reputation: 54
@Richard Cranium, thank you for mentioning smartd.conf as I did not have that configured properly at all to monitor my devices!

@upnort, good idea on matching the email's timestamp and the system logs. I've attached a snippet of it in case, but I've been experimenting with rtcwake to put the system to "freeze" state until 23:59.59 whereby the system would come out of sleep and after about 10 minutes begin to do an rsnapshot backup. I'm wondering if the sdc didn't wake up quick enough for the raid that it decided to resync. that said, according to mdadm --detail the raid has been rebuilt, however the drive is still in fault mode...

Code:
Apr  4 00:00:11 defiant kernel: [ 5853.713271] Freezing user space processes ... (elapsed 0.001 seconds) done.
Apr  4 00:00:11 defiant kernel: [ 5853.714589] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Apr  4 00:00:11 defiant kernel: [ 5853.717291] sd 1:0:1:0: [sdc] Synchronizing SCSI cache
Apr  4 00:00:11 defiant kernel: [ 5853.717463] parport_pc 00:04: disabled
Apr  4 00:00:11 defiant kernel: [ 5853.717842] serial 00:03: disabled
Apr  4 00:00:11 defiant kernel: [ 5853.718304] serial 00:02: disabled
Apr  4 00:00:11 defiant kernel: [ 5853.718705] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Apr  4 00:00:11 defiant kernel: [ 5853.718853] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Apr  4 00:00:11 defiant kernel: [ 5853.718995] sd 0:0:0:0: [sda] Stopping disk
Apr  4 00:00:11 defiant kernel: [ 5853.719313] e1000e: EEE TX LPI TIMER: 00000000
Apr  4 00:00:11 defiant kernel: [ 5853.719339] e1000e: EEE TX LPI TIMER: 00000000
Apr  4 00:00:11 defiant kernel: [ 5853.729423] sd 1:0:1:0: [sdc] Stopping disk
Apr  4 00:00:11 defiant kernel: [ 5853.729546] sd 1:0:0:0: [sdb] Stopping disk
Apr  4 00:00:11 defiant kernel: [ 5858.063093] sd 1:0:1:0: [sdc] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Apr  4 00:00:11 defiant kernel: [ 5858.063100] sd 1:0:1:0: [sdc] tag#0 Sense Key : 0xb [current] [descriptor] 
Apr  4 00:00:11 defiant kernel: [ 5858.063105] sd 1:0:1:0: [sdc] tag#0 ASC=0x47 ASCQ=0x0 
Apr  4 00:00:11 defiant kernel: [ 5858.063112] sd 1:0:1:0: [sdc] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 63 0e 33 80 00 00 05 80 00 00
Apr  4 00:00:11 defiant kernel: [ 5858.063223] md: md0: resync interrupted.
Apr  4 00:00:11 defiant kernel: [ 5858.074053] PM: suspend of devices complete after 4357.854 msecs
Apr  4 00:00:11 defiant kernel: [ 5858.085051] PM: late suspend of devices complete after 10.986 msecs
Apr  4 00:00:11 defiant kernel: [ 5858.086220] pcieport 0000:00:1c.4: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086307] pcieport 0000:00:1c.2: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086479] uhci_hcd 0000:00:1d.2: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086485] ehci-pci 0000:00:1d.7: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086605] uhci_hcd 0000:00:1d.1: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086635] uhci_hcd 0000:00:1d.0: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086787] uhci_hcd 0000:00:1a.2: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086794] ehci-pci 0000:00:1a.7: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086862] uhci_hcd 0000:00:1a.1: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.086906] uhci_hcd 0000:00:1a.0: System wakeup enabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 5858.097453] PM: noirq suspend of devices complete after 12.365 msecs
Apr  4 00:00:11 defiant kernel: [ 5920.977004] Task dump for CPU 1:
Apr  4 00:00:11 defiant kernel: [ 5920.977004] swapper/1       R  running task        0     0      1 0x00200000
Apr  4 00:00:11 defiant kernel: [ 5920.977004] Task dump for CPU 2:
Apr  4 00:00:11 defiant kernel: [ 5920.977004] swapper/2       R  running task        0     0      1 0x00200000
Apr  4 00:00:11 defiant kernel: [ 5980.979004] Task dump for CPU 1:
Apr  4 00:00:11 defiant kernel: [ 5980.979004] swapper/1       R  running task        0     0      1 0x00200000
Apr  4 00:00:11 defiant kernel: [ 5980.979004] Task dump for CPU 2:
Apr  4 00:00:11 defiant kernel: [ 5980.979004] swapper/2       R  running task        0     0      1 0x00200000
Apr  4 00:00:11 defiant kernel: [ 6033.716048] sd 1:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Apr  4 00:00:11 defiant kernel: [ 6033.716055] sd 1:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 63 0e 3e 00 00 00 04 00 00 00
Apr  4 00:00:11 defiant kernel: [ 6033.722081] uhci_hcd 0000:00:1a.0: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.722162] uhci_hcd 0000:00:1a.1: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.722233] uhci_hcd 0000:00:1a.2: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.722575] uhci_hcd 0000:00:1d.0: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.722664] uhci_hcd 0000:00:1d.1: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.722761] uhci_hcd 0000:00:1d.2: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.733456] pcieport 0000:00:1c.4: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.733522] ehci-pci 0000:00:1d.7: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.733625] ehci-pci 0000:00:1a.7: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.733898] PM: noirq resume of devices complete after 12.017 msecs
Apr  4 00:00:11 defiant kernel: [ 6033.740204] PM: early resume of devices complete after 6.220 msecs
Apr  4 00:00:11 defiant kernel: [ 6033.740634] rtc_cmos 00:01: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.740650] pcieport 0000:00:1c.2: System wakeup disabled by ACPI
Apr  4 00:00:11 defiant kernel: [ 6033.743172] serial 00:02: activated
Apr  4 00:00:11 defiant kernel: [ 6033.745655] serial 00:03: activated
Apr  4 00:00:11 defiant kernel: [ 6033.753640] parport_pc 00:04: activated
Apr  4 00:00:11 defiant kernel: [ 6033.817883] sd 0:0:0:0: [sda] Starting disk
Apr  4 00:00:11 defiant kernel: [ 6033.817885] sd 1:0:0:0: [sdb] Starting disk
Apr  4 00:00:11 defiant kernel: [ 6033.817918] sd 1:0:1:0: [sdc] Starting disk
Apr  4 00:00:11 defiant kernel: [ 6034.068735] ata3: SATA link down (SStatus 0 SControl 300)
Apr  4 00:00:11 defiant kernel: [ 6034.079464] ata4: SATA link down (SStatus 0 SControl 300)
Apr  4 00:00:11 defiant kernel: [ 6034.520073] ata2.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:00:11 defiant kernel: [ 6034.520086] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:00:11 defiant kernel: [ 6034.523151] ata2.01: ACPI cmd ef/03:45:00:00:00:b0 (SET FEATURES) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.523156] ata2.01: ACPI cmd ef/03:0c:00:00:00:b0 (SET FEATURES) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.523319] ata2.01: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.524070] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  4 00:00:11 defiant kernel: [ 6034.524082] ata1.01: SATA link down (SStatus 0 SControl 300)
Apr  4 00:00:11 defiant kernel: [ 6034.527159] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.527164] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.529150] ata2.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.529155] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.529310] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.529381] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Apr  4 00:00:11 defiant kernel: [ 6034.538288] ata2.00: configured for UDMA/133
Apr  4 00:00:11 defiant kernel: [ 6034.544287] ata2.01: configured for UDMA/133
Apr  4 00:00:11 defiant kernel: [ 6036.687034] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Apr  4 00:00:11 defiant kernel: [ 6038.086733] ata1.00: configured for UDMA/133
Apr  4 00:00:11 defiant kernel: [ 6046.837423] PM: resume of devices complete after 13097.208 msecs
Apr  4 00:00:11 defiant kernel: [ 6046.861968] Restarting tasks ... done.
Apr  4 00:00:11 defiant kernel: [ 6046.905404] md: checkpointing resync of md0.
Apr  4 00:00:11 defiant kernel: [ 6046.975363] md: resync of RAID array md0
Apr  4 00:00:11 defiant kernel: [ 6046.975370] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Apr  4 00:00:11 defiant kernel: [ 6046.975374] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
Apr  4 00:00:11 defiant kernel: [ 6046.975381] md: using 128k window, over a total of 2930162552k.
Apr  4 00:00:11 defiant kernel: [ 6046.975385] md: resuming resync of md0 from checkpoint.
Apr  4 00:00:11 defiant kernel: [ 6046.975719] md: md0: resync done.
Apr  5 00:00:25 defiant kernel: [ 6912.543143] PM: Syncing filesystems ... done.
Apr  5 00:00:25 defiant kernel: [ 6913.061333] Freezing user space processes ... (elapsed 0.001 seconds) done.
Apr  5 00:00:25 defiant kernel: [ 6913.062642] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Apr  5 00:00:25 defiant kernel: [ 6913.065267] parport_pc 00:04: disabled
Apr  5 00:00:25 defiant kernel: [ 6913.065421] sd 1:0:1:0: [sdc] Synchronizing SCSI cache
Apr  5 00:00:25 defiant kernel: [ 6913.065585] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Apr  5 00:00:25 defiant kernel: [ 6913.065616] sd 1:0:1:0: [sdc] Stopping disk
Apr  5 00:00:25 defiant kernel: [ 6913.065795] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Apr  5 00:00:25 defiant kernel: [ 6913.065807] sd 1:0:0:0: [sdb] Stopping disk
Apr  5 00:00:25 defiant kernel: [ 6913.065825] serial 00:03: disabled
Apr  5 00:00:25 defiant kernel: [ 6913.065964] sd 0:0:0:0: [sda] Stopping disk
Apr  5 00:00:25 defiant kernel: [ 6913.066412] serial 00:02: disabled
Apr  5 00:00:25 defiant kernel: [ 6913.066499] e1000e: EEE TX LPI TIMER: 00000000
Apr  5 00:00:25 defiant kernel: [ 6913.066522] e1000e: EEE TX LPI TIMER: 00000000
Apr  5 00:00:25 defiant kernel: [ 6914.455057] PM: suspend of devices complete after 1390.900 msecs
Apr  5 00:00:25 defiant kernel: [ 6914.466057] PM: late suspend of devices complete after 10.988 msecs
Apr  5 00:00:25 defiant kernel: [ 6914.466957] pcieport 0000:00:1c.4: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467380] uhci_hcd 0000:00:1d.2: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467382] ehci-pci 0000:00:1d.7: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467459] uhci_hcd 0000:00:1d.1: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467512] uhci_hcd 0000:00:1d.0: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467514] pcieport 0000:00:1c.2: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467634] ehci-pci 0000:00:1a.7: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467689] uhci_hcd 0000:00:1a.2: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467745] uhci_hcd 0000:00:1a.1: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.467787] uhci_hcd 0000:00:1a.0: System wakeup enabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.478256] PM: noirq suspend of devices complete after 12.166 msecs
Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1a.0: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1a.1: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1a.2: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1d.0: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1d.1: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.479261] uhci_hcd 0000:00:1d.2: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.489155] ehci-pci 0000:00:1d.7: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.489159] ehci-pci 0000:00:1a.7: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.490424] pcieport 0000:00:1c.4: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.490655] PM: noirq resume of devices complete after 12.303 msecs
Apr  5 00:00:25 defiant kernel: [ 6914.491450] PM: early resume of devices complete after 0.685 msecs
Apr  5 00:00:25 defiant kernel: [ 6914.492071] pcieport 0000:00:1c.2: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.497160] rtc_cmos 00:01: System wakeup disabled by ACPI
Apr  5 00:00:25 defiant kernel: [ 6914.503622] serial 00:02: activated
Apr  5 00:00:25 defiant kernel: [ 6914.510152] serial 00:03: activated
Apr  5 00:00:25 defiant kernel: [ 6914.512565] parport_pc 00:04: activated
Apr  5 00:00:25 defiant kernel: [ 6914.565401] sd 0:0:0:0: [sda] Starting disk
Apr  5 00:00:25 defiant kernel: [ 6914.565403] sd 1:0:0:0: [sdb] Starting disk
Apr  5 00:00:25 defiant kernel: [ 6914.565440] sd 1:0:1:0: [sdc] Starting disk
Apr  5 00:00:25 defiant kernel: [ 6914.816723] ata4: SATA link down (SStatus 0 SControl 300)
Apr  5 00:00:25 defiant kernel: [ 6914.827459] ata3: SATA link down (SStatus 0 SControl 300)
Apr  5 00:00:25 defiant kernel: [ 6915.271076] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  5 00:00:25 defiant kernel: [ 6915.271086] ata1.01: SATA link down (SStatus 0 SControl 300)
Apr  5 00:00:25 defiant kernel: [ 6915.273074] ata2.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  5 00:00:25 defiant kernel: [ 6915.273087] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr  5 00:00:25 defiant kernel: [ 6915.274152] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.274157] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.276145] ata2.01: ACPI cmd ef/03:45:00:00:00:b0 (SET FEATURES) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.276150] ata2.01: ACPI cmd ef/03:0c:00:00:00:b0 (SET FEATURES) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.276323] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.276399] ata2.01: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.282146] ata2.00: ACPI cmd ef/03:45:00:00:00:a0 (SET FEATURES) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.282151] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.282310] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Apr  5 00:00:25 defiant kernel: [ 6915.291309] ata2.00: configured for UDMA/133
Apr  5 00:00:25 defiant kernel: [ 6915.297302] ata2.01: configured for UDMA/133
Apr  5 00:00:25 defiant kernel: [ 6917.413041] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Apr  5 00:00:25 defiant kernel: [ 6918.709566] ata1.00: configured for UDMA/133
Apr  5 00:00:25 defiant kernel: [ 6940.057352] PM: resume of devices complete after 25565.892 msecs
Apr  5 00:00:25 defiant kernel: [ 6940.069055] Restarting tasks ... done.
mdadm --detail
Code:
{~}# mdadm --detail /dev/md0                                                     
/dev/md0:
        Version : 1.2
  Creation Time : Tue Aug  2 10:36:53 2011
     Raid Level : raid1
     Array Size : 2930162552 (2794.42 GiB 3000.49 GB)
  Used Dev Size : 2930162552 (2794.42 GiB 3000.49 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sun Apr  5 21:11:27 2020
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : defiant:0  (local to host defiant)
           UUID : a043a371:530d4c99:daed879a:904c0e11
         Events : 2181

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       2       0        0        2      removed

       1       8       33        -      faulty   /dev/sdc1
 
Old 04-06-2020, 02:49 AM   #7
Richard Cranium
Senior Member
 
Registered: Apr 2009
Location: Carrollton, Texas
Distribution: Slackware64 14.2
Posts: 3,714

Rep: Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060
Quote:
Originally Posted by bassmadrigal View Post
I'll admit that my brain is a little fuzzed out right now and I'm struggling making sense of the conf file and your line. Would you mind breaking down what your DEVICESCAN line is doing? If not, I can dig through things a bit more once my mind is in a better place.
One of the comments in /etc/smartd.conf is ...

Code:
# First ATA/SATA or SCSI/SAS disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03))
Further down, there's...

Code:
# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE.
# PLEASE SEE THE smartd.conf MAN PAGE FOR DETAILS
#
#   -d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N
#   -T TYPE set the tolerance to one of: normal, permissive
#   -o VAL  Enable/disable automatic offline tests (on/off)
#   -S VAL  Enable/disable attribute autosave (on/off)
#   -n MODE No check. MODE is one of: never, sleep, standby, idle
#   -H      Monitor SMART Health Status, report if failed
#   -l TYPE Monitor SMART log.  Type is one of: error, selftest
#   -f      Monitor for failure of any 'Usage' Attributes
#   -m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f
#   -M TYPE Modify email warning behavior (see man page)
#   -s REGE Start self-test when type/date matches regular expression (see man page)
#   -p      Report changes in 'Prefailure' Normalized Attributes
#   -u      Report changes in 'Usage' Normalized Attributes
#   -t      Equivalent to -p and -u Directives
#   -r ID   Also report Raw values of Attribute ID with -p, -u or -t
#   -R ID   Track changes in Attribute ID Raw value with -p, -u or -t
#   -i ID   Ignore Attribute ID for -f Directive
#   -I ID   Ignore Attribute ID for -p, -u or -t Directive
#   -C ID   Report if Current Pending Sector count non-zero
#   -U ID   Report if Offline Uncorrectable count non-zero
#   -W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit
#   -v N,ST Modifies labeling of Attribute N (see man page)
#   -a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
#   -F TYPE Use firmware bug workaround. Type is one of: none, samsung
#   -P TYPE Drive-specific presets: use, ignore, show, showall
#    #      Comment: text after a hash sign is ignored
#    \      Line continuation character
So, all put together, this...
Code:
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)
...means

For every smart capable device in the system:
  • Monitor SMART Health Status, report if failed
  • Monitor for failure of any 'Usage' Attributes
  • Report changes in 'Prefailure' Normalized Attributes
  • Report changes in 'Usage' Normalized Attributes
  • Monitor the error SMART log
  • Monitor the selftest SMART log
  • Report if Current Pending Sector count non-zero
  • Report if Offline Uncorrectable count non-zero
(all of that is what -a breaks down to)
PLUS..
  • Enable automatic offline tests (i.e -o on)
  • Enable attribute autosave (i.e., -S on)
  • No check on standby and don't bother to tell me that you skipped a test because of this (i.e., -n standby,q)
  • Run a short self-test between 2-3am every day and a long self-test every Saturday between 3-4am (i.e., -s (S/../.././02|L/../../6/03)

EDIT: I forgot to mention that smartd will also log to syslog if something bad happened in a test. If you already have a log-scraping tool that looks for things to alarm, then you can use that to send a warning email. smartd can also be configured to email warnings as well.

Last edited by Richard Cranium; 04-06-2020 at 02:53 AM.
 
2 members found this post helpful.
Old 04-06-2020, 10:34 AM   #8
bassmadrigal
LQ Guru
 
Registered: Nov 2003
Location: West Jordan, UT, USA
Distribution: Slackware
Posts: 7,240

Rep: Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932Reputation: 4932
Quote:
Originally Posted by Richard Cranium View Post
One of the comments in /etc/smartd.conf is ...

Code:
# First ATA/SATA or SCSI/SAS disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03))
Further down, there's...

Code:
# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE.
# PLEASE SEE THE smartd.conf MAN PAGE FOR DETAILS
#
#   -d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N
#   -T TYPE set the tolerance to one of: normal, permissive
#   -o VAL  Enable/disable automatic offline tests (on/off)
#   -S VAL  Enable/disable attribute autosave (on/off)
#   -n MODE No check. MODE is one of: never, sleep, standby, idle
#   -H      Monitor SMART Health Status, report if failed
#   -l TYPE Monitor SMART log.  Type is one of: error, selftest
#   -f      Monitor for failure of any 'Usage' Attributes
#   -m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f
#   -M TYPE Modify email warning behavior (see man page)
#   -s REGE Start self-test when type/date matches regular expression (see man page)
#   -p      Report changes in 'Prefailure' Normalized Attributes
#   -u      Report changes in 'Usage' Normalized Attributes
#   -t      Equivalent to -p and -u Directives
#   -r ID   Also report Raw values of Attribute ID with -p, -u or -t
#   -R ID   Track changes in Attribute ID Raw value with -p, -u or -t
#   -i ID   Ignore Attribute ID for -f Directive
#   -I ID   Ignore Attribute ID for -p, -u or -t Directive
#   -C ID   Report if Current Pending Sector count non-zero
#   -U ID   Report if Offline Uncorrectable count non-zero
#   -W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit
#   -v N,ST Modifies labeling of Attribute N (see man page)
#   -a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
#   -F TYPE Use firmware bug workaround. Type is one of: none, samsung
#   -P TYPE Drive-specific presets: use, ignore, show, showall
#    #      Comment: text after a hash sign is ignored
#    \      Line continuation character
So, all put together, this...
Code:
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03)
...means

For every smart capable device in the system:
  • Monitor SMART Health Status, report if failed
  • Monitor for failure of any 'Usage' Attributes
  • Report changes in 'Prefailure' Normalized Attributes
  • Report changes in 'Usage' Normalized Attributes
  • Monitor the error SMART log
  • Monitor the selftest SMART log
  • Report if Current Pending Sector count non-zero
  • Report if Offline Uncorrectable count non-zero
(all of that is what -a breaks down to)
PLUS..
  • Enable automatic offline tests (i.e -o on)
  • Enable attribute autosave (i.e., -S on)
  • No check on standby and don't bother to tell me that you skipped a test because of this (i.e., -n standby,q)
  • Run a short self-test between 2-3am every day and a long self-test every Saturday between 3-4am (i.e., -s (S/../.././02|L/../../6/03)

EDIT: I forgot to mention that smartd will also log to syslog if something bad happened in a test. If you already have a log-scraping tool that looks for things to alarm, then you can use that to send a warning email. smartd can also be configured to email warnings as well.
Awesome! That was really in depth and much easier to read than the conf file when I was looking at it yesterday. Thanks! I'll likely get this implemented when I get home tonight.
 
Old 04-06-2020, 10:43 AM   #9
Richard Cranium
Senior Member
 
Registered: Apr 2009
Location: Carrollton, Texas
Distribution: Slackware64 14.2
Posts: 3,714

Rep: Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060
Keep in mind there is a simple
Code:
DEVICESCAN
line near the top of the config file by default. Update that one or comment it out.
Otherwise, when you put your new DEVICESCAN line at the bottom of the file, the config file parser doesn't bother to look at it. Zero guesses on how I know that tidbit.
 
1 members found this post helpful.
Old 04-06-2020, 12:27 PM   #10
upnort
Senior Member
 
Registered: Oct 2014
Distribution: Slackware, Proxmox, Debian, CentOS
Posts: 1,637

Rep: Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971Reputation: 971
Quote:
I'm wondering if the sdc didn't wake up quick enough for the raid that it decided to resync.
Good point. I don't know. While Linux software RAID is well tested for a couple of decades, suspend might be outside the scope of the design. Primarily RAID targets systems running 24/7 -- business continuity. Something that suspends, like a laptop or home server that is powered down nightly, might not be an expected use case for software RAID. Might want to poke around the web.

Quote:
however the drive is still in fault mode
The state is listed as clean, degraded. Look into how to remove the degraded state.
 
1 members found this post helpful.
Old 04-06-2020, 01:57 PM   #11
dimm0k
Member
 
Registered: May 2008
Location: Brooklyn ZOO
Distribution: Slackware64 14.2
Posts: 545

Original Poster
Rep: Reputation: 54
Quote:
Originally Posted by upnort View Post
Good point. I don't know. While Linux software RAID is well tested for a couple of decades, suspend might be outside the scope of the design. Primarily RAID targets systems running 24/7 -- business continuity. Something that suspends, like a laptop or home server that is powered down nightly, might not be an expected use case for software RAID. Might want to poke around the web.

what you said definitely makes sense! thank you, I'll poke some more to see if anyone has any info on this!

The state is listed as clean, degraded. Look into how to remove the degraded state.
working on that now!
 
Old 04-06-2020, 07:56 PM   #12
Richard Cranium
Senior Member
 
Registered: Apr 2009
Location: Carrollton, Texas
Distribution: Slackware64 14.2
Posts: 3,714

Rep: Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060
One more thing, I've put this into my /etc/rc.d/rc.local (which is more RAID related than smartd related)...
Code:
# Increase timeouts for all non-ERC drives.
# see https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
for i in /dev/sd? ; do
    if smartctl -l scterc,70,70 ${i} > /dev/null ; then
        echo -n ${i} " is good "
    else
        echo 180 > /sys/block/${i/\/dev\/}/device/timeout
        echo -n ${i} " is  bad "
    fi;
    smartctl -i ${i} | egrep "(Device Model|Product:)"
    blockdev --setra 1024 ${i}
done
The link in the code block explains the issue.
 
1 members found this post helpful.
Old 04-07-2020, 09:12 AM   #13
dimm0k
Member
 
Registered: May 2008
Location: Brooklyn ZOO
Distribution: Slackware64 14.2
Posts: 545

Original Poster
Rep: Reputation: 54
Quote:
Originally Posted by Richard Cranium View Post
One more thing, I've put this into my /etc/rc.d/rc.local (which is more RAID related than smartd related)...
Code:
# Increase timeouts for all non-ERC drives.
# see https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
for i in /dev/sd? ; do
    if smartctl -l scterc,70,70 ${i} > /dev/null ; then
        echo -n ${i} " is good "
    else
        echo 180 > /sys/block/${i/\/dev\/}/device/timeout
        echo -n ${i} " is  bad "
    fi;
    smartctl -i ${i} | egrep "(Device Model|Product:)"
    blockdev --setra 1024 ${i}
done
The link in the code block explains the issue.
thanks for this, definitely was not aware of this! while it definitely can be used on my desktop, it unfortunately does not work for the older drives I have on my backup server! this does shed more light on what happened with my drive on this system.
 
Old 04-08-2020, 01:01 AM   #14
Richard Cranium
Senior Member
 
Registered: Apr 2009
Location: Carrollton, Texas
Distribution: Slackware64 14.2
Posts: 3,714

Rep: Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060Reputation: 2060
To be honest, I'm fairly certain that someone else on the forum mentioned the timeout mismatch; I don't remember who did so or when they did.

While it's possible that I ran across this while reading the RAID wiki, the mere fact that this is the first time I'm posted anything about it, tells me that someone else beat me to the punch. Hopefully they'll show up and tell us when I saw said mention.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to Run an event while another event is running in wxWidgets GUI? Sarathsankar Programming 11 10-24-2017 09:30 AM
[SOLVED] if [[ -n "$1" ]]; then FAIL FAIL FAIL rbees Programming 7 03-25-2015 02:39 PM
Print ID of important event which don't have associated click event raheel_com88 Linux - Server 1 05-31-2013 09:21 AM
Determining IP Information eth0 fail noor kutubul Linux - Networking 2 03-11-2013 12:25 PM
Fire Event from other event in Visual C++ Express Edition chrisliando Programming 1 11-08-2007 05:12 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 08:01 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration