[SOLVED] Mysterious errors with SSD

1337_powerslacker · 09-14-2016, 01:02 PM

The past 2 weeks or so, I have been seeing this error early in the boot process:

Quote:

[ 8.507719] blk_update_request: I/O error, dev sda, sector 29390672
[ 8.509482] blk_update_request: I/O error, dev sda, sector 29390696
[ 8.511227] blk_update_request: I/O error, dev sda, sector 29390704
[ 8.512913] blk_update_request: I/O error, dev sda, sector 29390712
[ 8.514563] blk_update_request: I/O error, dev sda, sector 29390720
[ 8.516222] blk_update_request: I/O error, dev sda, sector 29390728
[ 8.517800] blk_update_request: I/O error, dev sda, sector 29390776
[ 8.519364] blk_update_request: I/O error, dev sda, sector 29390784
[ 8.520917] blk_update_request: I/O error, dev sda, sector 29390792
[ 8.522415] blk_update_request: I/O error, dev sda, sector 29390800

I Googled the error, and one of the sites I clicked on said that the drive might be failing, and suggested to run the smartctl command. The specific command and results are shown below:

Code:

sudo smartctl -a -d ata /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.7.3-ck3] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs                                                                                
Device Model:     MKNSSDEC240GB                                                                                        
Serial Number:    ME151116100077F27                                                                                    
LU WWN Device Id: 5 888914 100077f27                                                                                   
Firmware Version: 604ABBF0                                                                                             
User Capacity:    240,057,409,536 bytes [240 GB]                                                                       
Sector Size:      512 bytes logical/physical                                                                           
Rotation Rate:    Solid State Device                                                                                   
Device is:        In smartctl database [for details use: -P show]                                                      
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3                                                                
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)                                                               
Local Time is:    Wed Sep 14 12:46:41 2016 CDT                                                                         
SMART support is: Available - device has SMART capability.                                                             
SMART support is: Enabled                                                                                              

=== START OF READ SMART DATA SECTION ===                                                                               
SMART overall-health self-assessment test result: PASSED                                                               

General SMART Values:                                                                                                  
Offline data collection status:  (0x00) Offline data collection activity                                               


Self-test execution status:      (   0) The previous self-test routine completed                                       


Total time to complete Offline                                                                                         
data collection:                (    0) seconds.                                                                       
Offline data collection                                                                                                
capabilities:                    (0x7d) SMART execute Offline immediate.                                               







SMART capabilities:            (0x0003) Saves SMART data before entering                                               


Error logging capability:        (0x01) Error logging supported.                                                       

Short self-test routine                                                                                                
recommended polling time:        (   1) minutes.                                                                       
Extended self-test routine                                                                                             
recommended polling time:        (  48) minutes.                                                                       
Conveyance self-test routine                                                                                           
recommended polling time:        (   2) minutes.                                                                       
SCT capabilities:              (0x0025) SCT Status supported.                                                          


SMART Attributes Data Structure revision number: 10                                                                    
Vendor Specific SMART Attributes with Thresholds:                                                                      
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                       
  1 Raw_Read_Error_Rate     0x0032   120   120   050    Old_age   Always       -       0/0                             
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0                               
  9 Power_On_Hours_and_Msec 0x0032   097   097   000    Old_age   Always       -       2864h+42m+18.480s               
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       535                             
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0                               
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0                               
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       76                              
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       0                               
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0                               
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0                               
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0                               
190 Airflow_Temperature_Cel 0x0000   028   050   000    Old_age   Offline      -       28 (Min/Max 17/50)              
194 Temperature_Celsius     0x0022   028   050   000    Old_age   Always       -       28 (Min/Max 17/50)              
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/0
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/0
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/0
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       8589934592
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       1572
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       492
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       492
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       857

SMART Error Log not supported

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The site also said that the (other user that showed this data) drive was faulty and needed to be replaced. This is a Mushkin drive that is only a few months old. Do I really need to replace it so soon?

Thanks for any input!

Regards,

Matt

phenixia2003 · 09-14-2016, 01:35 PM

Hello,

I'm not an expert, but I see nothing wrong in your smartcl report. The drive seems to be in good health.

You can try to run a short and/or a long selftest :

Code:

$ smartctl --test=short /dev/sda

$ smartctl --test=long /dev/sda

A short test should take 2 minutes and 48 minutes for a long test. When the test is terminated, run smartctl as below to get test results :

Code:

$ smartctl -l selftest /dev/sda

You can also check your sata cable, and even test with another if possible.

--
SeB

solarfields · 09-14-2016, 02:27 PM

i had a disk that died, showing such error. I remember the I/O error thing. Backup if you can.

bassmadrigal · 09-14-2016, 02:50 PM

As phenixia2003 said, your drive doesn't seem to be showing any smart errors. The main one to look out for specifically with SSDs is Reallocated_Event_Count. This "error" will start cropping up when you start going beyond the usable write amounts for a cell and it has to start moving data off worn out cells. Your value is still at 0, so nothing has worn out according to smart data.

I would also do as phenixia2003 recommended and check your cable that it's fully seated or replace it with another cable.

But then, it is always possible that the error showing up in your dmesg is something that SMART doesn't log, so your drive could be a dud. If you have another computer, it might be worth checking in there to see if you get the same warning (which would remove the motherboard being the problem from the equation).

As always, it wouldn't hurt to back up your important stuff, just in case things go sideways.

Ilgar · 09-14-2016, 02:57 PM

Like phenixia2003, I also think that the smartctl output looks OK, except maybe

Code:

174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       76

Normally failed writes/reads should should also show in the relevant fields of smartcl. Perhaps it's not the drive itself but the wiring?

kjhambrick · 09-14-2016, 03:11 PM

1337_powerslacker --

Like Ilgar, it seems to me that the Unexpect_Power_Loss_Ct is suspicious.

If the cable or the interface on the MoBo is bad, all the drive would know is 'power loss'

OTOH, the OS might see some sort of request error.

Check the Cable ( and Card, if one exists ) ?

-- kjh(

the only other oddity I see is that I've never had a drive in the smartctl database ... mine are all reported as not in the DataBase

)

BratPit · 09-17-2016, 03:13 AM

http://lkcl.net/reports/ssd_analysis.html
https://news.ycombinator.com/item?id=10552218
https://forums.anandtech.com/threads...ction.2452606/

1. Check filesystem

OR

2. Do a so called "secure erase procedure"

https://www.thomas-krenn.com/en/wiki/SSD_Secure_Erase
https://www.unixmen.com/secure-erase-your-ssd/

kjhambrick · 09-17-2016, 06:06 AM

Quote:

Originally Posted by BratPit

... <snip> ...

OR

2. Do a so called "secure erase procedure"

https://www.thomas-krenn.com/en/wiki/SSD_Secure_Erase
https://www.unixmen.com/secure-erase-your-ssd/

Eeek !!!

I've not read the content of the 'Secure Erase' links, but unless they've discovered such-a-thing as a 'non-destructive Secure Erase'

then #2 above sounds like a last resort ?

Or am I missing something ?

If I understand Secure Erase and the drive has actually gone bad then a Sledge Hammer is MUCH quicker than Secure Erase and it take MUCH LESS effort to erase an SSD with a Sledge Hammer than it does with HDDs

Looking at 1337_powerslacker's `smartctl` Report, the drive itself looks OK.

IMO, check the Interface Components ( Cable and SATA Connector and optionally any SATA Card ) before doing anything else.

-- kjh

BratPit · 09-17-2016, 07:29 AM

SE is there to bring back disk to life if possible, not mobo,sata interface, filesystem etc.
That must be checked first.
It costs lost data.
If that fail after that your Sledge Hammer is very reasonable option but not first.

"Report, the drive itself looks OK"

Ya "itself" but telling nothing about possible power failure controller:-)

Sometimes SMART means not so smart.

1337_powerslacker · 09-17-2016, 08:46 AM

Well, as has been suggested several times, I opened my case and re-seated the power and SATA cables, so that there's no question of incomplete electrical contact on either. I booted up, and no errors occurred. So I think this was a one-off, where there may have been some inadvertent jostling of cables when I was fiddling in my computer's internals.

Thanks for the suggestions, everyone! It was much appreciated!

Happy Slacking!