Help needed debugging disk errors

Olek · 11-05-2016, 06:20 AM

Make

Code:

#smartctl -t long /dev/sda

After this command, You will get information about when this test end.
By example my 3TB disk test takes about 5 hours.

After end of test make

Code:

smartctl -a /dev/sda

and you will see real number of pending sectors.

rknichols · 11-05-2016, 09:05 AM

That increase in the pending sector count doesn't necessarily mean that anything changed. A bad sector won't be discovered and marked "pending" until something tries to read it.

I have to wonder, though, whether something might have turned off the drive's automatic defect management. That would explain the write error on the bad sector. I thought that modern drives no longer had the ability to turn that off, but perhaps yours is one of the exceptions. See the paragraph for the "-D" option in the hdparm manpage.

rknichols · 11-05-2016, 09:06 AM

Quote:

Originally Posted by Olek

Make

Code:

#smartctl -t long /dev/sda

After this command, You will get information about when this test end.
By example my 3TB disk test takes about 5 hours.

After end of test make

Code:

smartctl -a /dev/sda

and you will see real number of pending sectors.

Unfortunately, that test stops on the first error it encounters, so it won't uncover further bad sectors.

atelszewski · 11-05-2016, 12:59 PM

Hi,

For all of you SMART people (no pun intended :-)), after smartctl -t long (yes, I waited for the requested time before using -a switch):

Code:

$ smartctl -a /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.29] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HTE721010A9E630
Serial Number:    JR10034M2Y2MXK
LU WWN Device Id: 5 000cca 8a8e967b0
Firmware Version: JB0OA3M0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Nov  5 18:49:12 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(   45) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 170) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       65536
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   127   127   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       23
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       2885
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
191 G-Sense_Error_Rate      0x000a   076   076   000    Old_age   Always       -       198415
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       3360
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       10
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       24
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2879         9548728
# 2  Short offline       Completed without error       00%      2783         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
Best regards,
Andrzej Telszewski

Emerson · 11-05-2016, 01:21 PM

Code:

1  Extended offline    Completed: read failure       90%      2879         9548728

Warranty. It failed at 10%.

rknichols · 11-05-2016, 01:27 PM

Quote:

Originally Posted by atelszewski

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 2879 9548728[/code]

As expected, the test found an error and stopped. This was less than 1% of the way through the 976762584 sectors of disk. Pointless.

If you really want to find out how many bad sectors there are, run

Code:

dd if=/dev/sda of=/dev/null bs=4k conv=noerror

and then look at the number of pending sectors. I do not recommend doing this before recovering whatever data you can. Beating on a dying disk just to see how bad it is is not productive, and can make the problems worse. Using ddrescue to make an image with the readable sectors would be a better alternative.

atelszewski · 11-05-2016, 02:25 PM

Hi,

Just a side question.
Would it be wise to go with 2 SSD-s in RAID-1 configuration?
That's probably something that I could afford from the monetary point of view.

Please note that it's my favorite toy machine.
I want it to be the best possible, within sensible budget.
Loss of data wouldn't cause major injuries, and there are backups too.
It just feels better with the uptime ticking up continuously :-)

--
Best regards,
Andrzej Telszewski

Emerson · 11-05-2016, 02:30 PM

RAID-1 is for read speed. No redundancy really. Isn't SSD already fast enough for you?

atelszewski · 11-05-2016, 02:36 PM

Hi,

Quote:

Originally Posted by Emerson

RAID-1 is for read speed. No redundancy really. Isn't SSD already fast enough for you?

Have I misunderstood Wiki?
Aren't there two copies?

--
Best regards,
Andrzej Telszewski

Emerson · 11-05-2016, 03:29 PM

Two copies, yes. One gets corrupted the other one gets corrupted, too. Only in case one drive dies suddenly the other one will have the data intact.

atelszewski · 11-05-2016, 03:33 PM

Hi,

Quote:

Originally Posted by Emerson

Two copies, yes. One gets corrupted the other one gets corrupted, too. Only in case one drive dies suddenly the other one will have the data intact.

OK, that's what I was afraid of when I read about RAID-1.
So I would need something with error correction.
I'm goon have a look at the possibilities, but most probably I'm gonna give up on the idea.

Thanks.

--
Best regards,
Andrzej Telszewski

rknichols · 11-05-2016, 04:09 PM

RAID-1 will protect against data loss due to a drive failure. That is one cause of data loss. There is no form of RAID that protects against the other causes of data loss, such as accidental deletion, overwriting, OS failures that corrupt the filesystem, etc. RAID is not a substitute for backups. And of course RAID adds its own complexity and modes of failure to the mix. Its primary function is to allow a system to keep running seamlessly while a failed drive is replaced. If that is important vs. the hours of down time while a failed drive is replaced and restored from backup, then you need RAID. Otherwise, not so much, aside from the bragging rights about your continuous uptime (assuming that your drives are hot-swappable -- which they probably are not).

atelszewski · 11-07-2016, 11:39 AM

Hi,

There was no possibility to upgrade the hardware of this server.
I changed to the same class one, with 250GB SSD.
2 moving parts less to wear out ;-)

--
Best regards,
Andrzej Telszewski