LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   Help needed debugging disk errors (https://www.linuxquestions.org/questions/slackware-14/help-needed-debugging-disk-errors-4175592849/)

atelszewski 11-03-2016 06:32 PM

Help needed debugging disk errors
 
Hi,

Slackware: 14.2
Kernel: 4.4.29
LVM2 based disk management.

I would like to ask you how should I start debugging errors related to the disk?

Below is an example of the problem. It happened after:
1) removepkg kernel-modules
2) installpkg kernel-modules
3) sync
4) echo 3 > /proc/sys/vm/drop_caches

But I had also similar situation when rsync-ing Slackware mirror.

Any help appreciated!

Code:

[213373.488513] sd 0:0:0:0: [sda] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488522] sd 0:0:0:0: [sda] tag#9 CDB: opcode=0x2a 2a 00 00 8a ce 20 00 05 b8 00
[213373.488526] blk_update_request: I/O error, dev sda, sector 9096736
[213373.488537] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265224 (offset 0 size 16384 starting block 1103812)
[213373.488542] Buffer I/O error on device dm-0, logical block 1103812
[213373.488548] Buffer I/O error on device dm-0, logical block 1103813
[213373.488551] Buffer I/O error on device dm-0, logical block 1103814
[213373.488554] Buffer I/O error on device dm-0, logical block 1103815
[213373.488562] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265225 (offset 0 size 167936 starting block 1103816)
[213373.488565] Buffer I/O error on device dm-0, logical block 1103816
[213373.488569] Buffer I/O error on device dm-0, logical block 1103817
[213373.488572] Buffer I/O error on device dm-0, logical block 1103818
[213373.488574] Buffer I/O error on device dm-0, logical block 1103819
[213373.488577] Buffer I/O error on device dm-0, logical block 1103820
[213373.488580] Buffer I/O error on device dm-0, logical block 1103821
[213373.488608] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265227 (offset 0 size 28672 starting block 1103857)
[213373.488619] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265228 (offset 0 size 16384 starting block 1103864)
[213373.488629] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265230 (offset 0 size 229376 starting block 1103868)
[213373.488675] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265234 (offset 0 size 200704 starting block 1103924)
[213373.488712] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265235 (offset 0 size 69632 starting block 1103973)
[213373.488729] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265237 (offset 0 size 20480 starting block 1103990)
[213373.488771] sd 0:0:0:0: [sda] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488777] sd 0:0:0:0: [sda] tag#8 CDB: opcode=0x2a 2a 00 00 41 33 00 00 01 78 00
[213373.488780] blk_update_request: I/O error, dev sda, sector 4272896
[213373.488877] sd 0:0:0:0: [sda] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488881] sd 0:0:0:0: [sda] tag#7 CDB: opcode=0x2a 2a 00 00 88 a6 e8 00 00 08 00
[213373.488883] blk_update_request: I/O error, dev sda, sector 8955624
[213373.488887] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265547 (offset 0 size 4096 starting block 1086173)
[213373.488901] sd 0:0:0:0: [sda] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488927] sd 0:0:0:0: [sda] tag#6 CDB: opcode=0x2a 2a 00 00 88 a5 e8 00 00 08 00
[213373.488929] blk_update_request: I/O error, dev sda, sector 8955368
[213373.488934] EXT4-fs warning (device dm-0): ext4_end_bio:329: I/O error -5 writing to inode 265546 (offset 0 size 4096 starting block 1086141)
[213373.488950] sd 0:0:0:0: [sda] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488954] sd 0:0:0:0: [sda] tag#5 CDB: opcode=0x2a 2a 00 00 88 9e 58 00 00 08 00
[213373.488956] blk_update_request: I/O error, dev sda, sector 8953432
[213373.488970] sd 0:0:0:0: [sda] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488973] sd 0:0:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 00 88 92 a8 00 00 08 00
[213373.488975] blk_update_request: I/O error, dev sda, sector 8950440
[213373.488988] sd 0:0:0:0: [sda] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.488992] sd 0:0:0:0: [sda] tag#3 CDB: opcode=0x2a 2a 00 00 88 86 e8 00 00 08 00
[213373.488994] blk_update_request: I/O error, dev sda, sector 8947432
[213373.489006] sd 0:0:0:0: [sda] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.489010] sd 0:0:0:0: [sda] tag#2 CDB: opcode=0x2a 2a 00 00 88 7d 38 00 00 08 00
[213373.489012] blk_update_request: I/O error, dev sda, sector 8944952
[213373.489025] sd 0:0:0:0: [sda] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.489028] sd 0:0:0:0: [sda] tag#1 CDB: opcode=0x2a 2a 00 00 88 6f 40 00 00 08 00
[213373.489030] blk_update_request: I/O error, dev sda, sector 8941376
[213373.489043] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[213373.489046] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 00 88 6d f8 00 00 08 00
[213373.489048] blk_update_request: I/O error, dev sda, sector 8941048
[213373.489261] Aborting journal on device dm-0-8.
[213378.520820] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[213378.608874] EXT4-fs (dm-0): Remounting filesystem read-only
[213396.727021] EXT4-fs error (device dm-0): ext4_journal_check_start:56:
[213396.727081] EXT4-fs error (device dm-0): ext4_journal_check_start:56:
[213396.727082] Detected aborted journal

[213396.727123] EXT4-fs error (device dm-0): ext4_journal_check_start:56:
[213396.727124] Detected aborted journal

[213396.728011] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1011 pages, ino 266299; err -30
[213396.990694] Detected aborted journal

[213505.988562] bash (3488): drop_caches: 3

--
Best regards,
Andrzej Telszewski

rknichols 11-03-2016 06:43 PM

Perhaps get a new disk and restore from backup? Is this a conventional, rotating disk or an SSD? It is awfully unusual for writes to fail on a rotating disk unless the drive has exhausted its supply of spare sectors. For an SSD, becoming read-only is a common way for the drive to protect your data when the drive starts to fail.

What is the output from "smartctl -A /dev/sda"? (Please wrap it in [CODE]...[/CODE] tags to preserve formatting.)

bassmadrigal 11-03-2016 06:55 PM

Quote:

Originally Posted by rknichols (Post 5626684)
For an SSD, becoming read-only is a common way for the drive to protect your data when the drive starts to fail.

However, I believe most will brick themselves when the system is next rebooted (don't really understand why they don't just go permanently into a read-only state)... at least that is what the SSD challenge from TechReport seemed to indicate. Most drives failed into read-only mode, but only until a reboot, after which the drive effectively bricks itself and data becomes irretrievable.

So, before any reboots on a suspected failing SSD, it is best to ensure that any necessary data is backed up.

atelszewski 11-03-2016 07:06 PM

Hi,

The machine is an online.net's server, running just a couple (maybe 3) of months.
The disk is (or at least should be) fairly new and not used much. It's an HDD.

It looks like the problem starts when there is more write I/O, e.g. rsync, but I was rsync-ing before without problems.
The machine has 16GB of RAM and it's almost always fully used by buff/cache, if that matters.
Maybe it's only my bad feeling, but the shell prompt feels kinda sluggish, e.g. "free -m" takes sometimes seconds to execute. It used to be fast in the past.

BTW, I had to run fsck to boot the system at all again; there were those strange (TM) messages about orphaned inodes, something about bitmap, etc. Should I consider the filesystem/OS pending re-installation, e.g. could there be some file's permissions broken or some config file being corrupted? Re-installing is not a big deal, just time consuming.

BTW2, if I recall correctly, SMART was disabled on the disk and I enabled it with some smartctl switch some days ago, if that matters. Unfortunately I don't remember exactly what it was and .bash_history does not have it.

Code:

$ smartctl -A /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.29] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  099  099  062    Pre-fail  Always      -      131072
  2 Throughput_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
  3 Spin_Up_Time            0x0007  127  127  033    Pre-fail  Always      -      2
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      21
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0
  8 Seek_Time_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
  9 Power_On_Hours          0x0012  094  094  000    Old_age  Always      -      2843
 10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      2
191 G-Sense_Error_Rate      0x000a  076  076  000    Old_age  Always      -      198415
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      0
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      3357
194 Temperature_Celsius    0x0002  187  187  000    Old_age  Always      -      32 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      10
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      16
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0
223 Load_Retry_Count        0x000a  100  100  000    Old_age  Always      -      0

--
Best regards,
Andrzej Telszewski

syg00 11-03-2016 07:55 PM

The filesystem errors are probably a result of the disk starting to fail. The 196 events aren't good, the 197 events are bad. See wikipedia for a description.
Me, I'd get a new disk - doesn't matter how old the current one is.

Emerson 11-03-2016 08:02 PM

Code:

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      16
Definitely replace the drive.

rknichols 11-03-2016 10:43 PM

I have to agree that the disk doesn't look good. If those bad sectors were in some laptop that was getting bounced around while it was running I wouldn't get too excited about them, but it's not something you want to see in a server, especially one that's been running only ~118 days. Disks are cheap. Get rid of it.

MarcT 11-04-2016 12:56 PM

The drive encountered write errors on certain sectors of the disk. The sector numbers are given in the log output, eg 9096736, 4272896, 8955624, 8955368, etc.

First thing to do is make a backup. Do it now. The drive could fail catastrophically. However, sometimes the number of "bad sectors" will stabilise, and you might get useful life out of the disk in the future. It all depends on whether the values in the critical SMART parameters above keep increasing.

Usually a disk will reallocate (remap) bad sectors on a write, but for a drive with 4096 byte physical sectors (a "4K drive"), it can only do this if you re-write the entire 4k sector. Most operating systems use 512-byte logical sectors, so there are 8 logical sectors per 4K drive physical sector.

See whether you have a "4k" drive by running "smartctl -a /dev/sda", and looking for this:

Quote:

Sector Sizes: 512 bytes logical, 4096 bytes physical

Now you can use:

Quote:

hdparm --read-sector X /dev/sda
to check these sectors. What I'd do is work out the ranges of bad sectors around each error. For a 4k drive, you'd expect a multiple of 8 logical sectors surrounding each identified bad sector.

If you're feeling lucky, you can then use:
Quote:

hdparm --write-sector X /dev/sda
...which will write zeros into the "bad" sector. If you zero the entire 4k range (ie 8 sectors), it should provoke the drive to remap the bad physical sector. A subsequent read of the range should then succeed. If so, congratulations, you've "fixed" the bad sector.

Repeat for all the identified bad sectors...

The problem now is you have a filesystem with some random "holes" in it. These could be within user files (in which case the file will now be corrupted), or filesystem meta-data, or just in unallocated space (if you're lucky!). It's possible to perform filesystem "forensics" to determine what was allocated to those sectors, but it's not trivial. For now, just force fsck the filesystem and allow it to repair any structural errors. Keep all the logs, as they may be useful for forensics in the future.

Keep an eye on the SMART stats to see if any of items #5 & #196-198 continue to increase. If they do, it's probably time to replace the drive.

FWIW, on our storage platform we normally replace drives if they exceed 600 bad sectors. However, the storage is mirrored (RAID1) so there are two copies of everything enabling lost sectors to be restored from the mirrored disk.

Regards,
Marc

atelszewski 11-04-2016 01:35 PM

Hi,

Thank you all for your replies.
I reported the problem to online.net support and after investigation they replaced the disk*.

But I don't mind if you keep posting interesting stuff here.

I did have a look at SMART, but to be honest, it is really hard to make sense of it if you don't have experience.
Even the replies to this thread aren't straightforward (i.e. replace immediately vs might be recoverable).

*) I can access my old server in rescue mode to recover data, and there is already second server machine waiting to switch to.

--
Best regards,
Andrzej Telszewski

Diantre 11-04-2016 02:14 PM

Quote:

Originally Posted by atelszewski (Post 5627001)
I did have a look at SMART, but to be honest, it is really hard to make sense of it if you don't have experience.
Even the replies to this thread aren't straightforward (i.e. replace immediately vs might be recoverable).

There's some information about SMART in this thread, it may be useful in your case.

rknichols 11-04-2016 03:18 PM

Quote:

Originally Posted by MarcT (Post 5626984)
Usually a disk will reallocate (remap) bad sectors on a write, but for a drive with 4096 byte physical sectors (a "4K drive"), it can only do this if you re-write the entire 4k sector. Most operating systems use 512-byte logical sectors, so there are 8 logical sectors per 4K drive physical sector.

Good point. I hadn't thought of that. From the looks of it, the affected sectors are holding file system metadata (inodes). These days, all but the tiniest filesystems allocate 4096-byte blocks for data, so the issue of trying to write to a partial physical sector doesn't arise there. I find it a bit surprising that the inodes aren't read and written with that same block size.

Come to think of it, how did that inode ever get into memory if the sector containing it is bad?

Drakeo 11-04-2016 08:23 PM

there are hundreds of thousands of replacement sectors to be used. Yours are used up get a new disk.

rknichols 11-04-2016 08:39 PM

Quote:

Originally Posted by Drakeo (Post 5627123)
there are hundreds of thousands of replacement sectors to be used. Yours are used up get a new disk.

The SMART report disagrees with you:
Code:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0

No spare sectors have been used.

Emerson 11-04-2016 08:51 PM

This is a good question, I just had a disk failure, 0 reallocated sectors, yet disk had plenty of bad sectors. Maybe SMART is not that smart after all.

atelszewski 11-05-2016 05:50 AM

Hi,

An update :-^

Code:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  093  093  062    Pre-fail  Always      -      2555904
  2 Throughput_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
  3 Spin_Up_Time            0x0007  127  127  033    Pre-fail  Always      -      2
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      23
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0
  8 Seek_Time_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
  9 Power_On_Hours          0x0012  094  094  000    Old_age  Always      -      2878
 10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      2
191 G-Sense_Error_Rate      0x000a  076  076  000    Old_age  Always      -      198415
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      0
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      3360
194 Temperature_Celsius    0x0002  181  181  000    Old_age  Always      -      33 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      10
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      24
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0
223 Load_Retry_Count        0x000a  100  100  000    Old_age  Always      -      0

--
Best regards,
Andrzej Telszewski

Olek 11-05-2016 06:20 AM

Make
Code:

#smartctl -t long /dev/sda
After this command, You will get information about when this test end.
By example my 3TB disk test takes about 5 hours.

After end of test make
Code:

smartctl -a /dev/sda
and you will see real number of pending sectors.

rknichols 11-05-2016 09:05 AM

That increase in the pending sector count doesn't necessarily mean that anything changed. A bad sector won't be discovered and marked "pending" until something tries to read it.

I have to wonder, though, whether something might have turned off the drive's automatic defect management. That would explain the write error on the bad sector. I thought that modern drives no longer had the ability to turn that off, but perhaps yours is one of the exceptions. See the paragraph for the "-D" option in the hdparm manpage.

rknichols 11-05-2016 09:06 AM

Quote:

Originally Posted by Olek (Post 5627199)
Make
Code:

#smartctl -t long /dev/sda
After this command, You will get information about when this test end.
By example my 3TB disk test takes about 5 hours.

After end of test make
Code:

smartctl -a /dev/sda
and you will see real number of pending sectors.

Unfortunately, that test stops on the first error it encounters, so it won't uncover further bad sectors.

atelszewski 11-05-2016 12:59 PM

Hi,

For all of you SMART people (no pun intended :-)), after smartctl -t long (yes, I waited for the requested time before using -a switch):
Code:

$ smartctl -a /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.29] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:    HGST HTE721010A9E630
Serial Number:    JR10034M2Y2MXK
LU WWN Device Id: 5 000cca 8a8e967b0
Firmware Version: JB0OA3M0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:  ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Nov  5 18:49:12 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)        Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)        The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  45) seconds.
Offline data collection
capabilities:                          (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003)        Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01)        Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:          (  2) minutes.
Extended self-test routine
recommended polling time:          ( 170) minutes.
SCT capabilities:                (0x003d)        SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  100  100  062    Pre-fail  Always      -      65536
  2 Throughput_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
  3 Spin_Up_Time            0x0007  127  127  033    Pre-fail  Always      -      2
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      23
  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0
  8 Seek_Time_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
  9 Power_On_Hours          0x0012  094  094  000    Old_age  Always      -      2885
 10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      2
191 G-Sense_Error_Rate      0x000a  076  076  000    Old_age  Always      -      198415
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      0
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      3360
194 Temperature_Celsius    0x0002  181  181  000    Old_age  Always      -      33 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      10
197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      24
198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0
223 Load_Retry_Count        0x000a  100  100  000    Old_age  Always      -      0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure      90%      2879        9548728
# 2  Short offline      Completed without error      00%      2783        -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
Best regards,
Andrzej Telszewski

Emerson 11-05-2016 01:21 PM

Code:

1  Extended offline    Completed: read failure      90%      2879        9548728
Warranty. It failed at 10%.

rknichols 11-05-2016 01:27 PM

Quote:

Originally Posted by atelszewski (Post 5627320)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 2879 9548728[/code]

As expected, the test found an error and stopped. This was less than 1% of the way through the 976762584 sectors of disk. Pointless.

If you really want to find out how many bad sectors there are, run
Code:

dd if=/dev/sda of=/dev/null bs=4k conv=noerror
and then look at the number of pending sectors. I do not recommend doing this before recovering whatever data you can. Beating on a dying disk just to see how bad it is is not productive, and can make the problems worse. Using ddrescue to make an image with the readable sectors would be a better alternative.

atelszewski 11-05-2016 02:25 PM

Hi,

Just a side question.
Would it be wise to go with 2 SSD-s in RAID-1 configuration?
That's probably something that I could afford from the monetary point of view.

Please note that it's my favorite toy machine.
I want it to be the best possible, within sensible budget.
Loss of data wouldn't cause major injuries, and there are backups too.
It just feels better with the uptime ticking up continuously :-)

--
Best regards,
Andrzej Telszewski

Emerson 11-05-2016 02:30 PM

RAID-1 is for read speed. No redundancy really. Isn't SSD already fast enough for you?

atelszewski 11-05-2016 02:36 PM

Hi,

Quote:

Originally Posted by Emerson (Post 5627355)
RAID-1 is for read speed. No redundancy really. Isn't SSD already fast enough for you?

Have I misunderstood Wiki?
Aren't there two copies?

--
Best regards,
Andrzej Telszewski

Emerson 11-05-2016 03:29 PM

Two copies, yes. One gets corrupted the other one gets corrupted, too. Only in case one drive dies suddenly the other one will have the data intact.

atelszewski 11-05-2016 03:33 PM

Hi,

Quote:

Originally Posted by Emerson (Post 5627366)
Two copies, yes. One gets corrupted the other one gets corrupted, too. Only in case one drive dies suddenly the other one will have the data intact.

OK, that's what I was afraid of when I read about RAID-1.
So I would need something with error correction.
I'm goon have a look at the possibilities, but most probably I'm gonna give up on the idea.

Thanks.

--
Best regards,
Andrzej Telszewski

rknichols 11-05-2016 04:09 PM

RAID-1 will protect against data loss due to a drive failure. That is one cause of data loss. There is no form of RAID that protects against the other causes of data loss, such as accidental deletion, overwriting, OS failures that corrupt the filesystem, etc. RAID is not a substitute for backups. And of course RAID adds its own complexity and modes of failure to the mix. Its primary function is to allow a system to keep running seamlessly while a failed drive is replaced. If that is important vs. the hours of down time while a failed drive is replaced and restored from backup, then you need RAID. Otherwise, not so much, aside from the bragging rights about your continuous uptime (assuming that your drives are hot-swappable -- which they probably are not).

atelszewski 11-07-2016 11:39 AM

Hi,

There was no possibility to upgrade the hardware of this server.
I changed to the same class one, with 250GB SSD.
2 moving parts less to wear out ;-)

--
Best regards,
Andrzej Telszewski


All times are GMT -5. The time now is 11:48 PM.