How does I/O error for HDD read looks like?
Posted 05-20-2011 at 04:01 AM by Web31337
Tags hard drive, hardware, io error, iowait
A bit of scary story with happy ending. Below I describe how failing HDD may look in your syslogs, and how I "resolved" this issue.
NOTICE: when you see I/O errors anywhere, or you see something like this in syslog YOUR FIRST TASK IS TO MAKE BACKUPS AS FAST AS POSSIBLE!!! ONLY WHEN YOU ARE SURE YOUR DATA IS NOW BACKED UP AND SAFE, you may proceed with something like "fixing"!
Recently had an accident involving few bad blocks on HDD. The target was Seagate 5900 rpm LP 2TB disk(identifier: ST32000542AS, CC34), quite new with one partition, formatted in ext2. While reading two files system hangs in iowait and then finally returns I/O error status.
Of course, I made backups to another HDD, assuming entire HDD can crash and then decided to investigate it a bit closer. Here's how it looks in syslog:
Actually, I had no much choice, since HDD is 2TB size and few bad blocks for such length can be an acceptable issue, which may be resolved by blacklisting failing sections.
So, from log above we can see a failing sector number. The funny thing is fsck unable to see those bad blocks, so we'll be using "badblocks" util. Let's calculate block numbers for it. As I had two files, I had messages about two sectors. Usually block size is 4096 but may vary, to be sure, check yours with:
Here are my blocks...
Next step: find out all bad blocks starting from these. You can choose full check with badblocks util, but that may take many hours. I am lazy guy, not really caring for that data since I have another copy, so I just calculated those above and issued the following command:
After few minutes of iowait I had four bad blocks in bad_sda1 output file.
Okay, let's move on with marking them:
After doing that kernel will no longer access those blocks, and also, I am able to operate with broken files(as they now moved from bad blocks).
Though, you should not take these habits from me: I was too lazy to give HDD back to the service(well, not because of that, more because of there was also some data on this HDD I was not sure I can erase completely, using "shred" util and because of knowing the expected answer(after months of waiting): "your HDD is fine and working good, no repair required") and too lazy to mark more blocks, just in case.
Currently this HDD works just fine. Could be only those blocks are really bad. If some more will appear, I'll just use another one. By the way, this seagate HDD is quite silent and ideal for storing movies/static data which doesn't require fast operation, which was my case. This post doesn't mark that model is bad: for instance, I have one more same HDD, not having any bad blocks at all(did a full check with badblocks).
NOTICE: when you see I/O errors anywhere, or you see something like this in syslog YOUR FIRST TASK IS TO MAKE BACKUPS AS FAST AS POSSIBLE!!! ONLY WHEN YOU ARE SURE YOUR DATA IS NOW BACKED UP AND SAFE, you may proceed with something like "fixing"!
Recently had an accident involving few bad blocks on HDD. The target was Seagate 5900 rpm LP 2TB disk(identifier: ST32000542AS, CC34), quite new with one partition, formatted in ext2. While reading two files system hangs in iowait and then finally returns I/O error status.
Of course, I made backups to another HDD, assuming entire HDD can crash and then decided to investigate it a bit closer. Here's how it looks in syslog:
Code:
2011-05-16T22:08:37.860907+06:00 (none) kernel: [270302.230913] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
2011-05-16T22:08:37.860921+06:00 (none) kernel: [270302.230918] ata1.00: BMDMA stat 0x64
2011-05-16T22:08:37.860923+06:00 (none) kernel: [270302.230924] ata1.00: failed command: READ DMA EXT
2011-05-16T22:08:37.860926+06:00 (none) kernel: [270302.230933] ata1.00: cmd 25/00:00:18:56:16/00:02:b2:00:00/e0 tag 0 dma 262144 in
2011-05-16T22:08:37.860929+06:00 (none) kernel: [270302.230935] res 51/40:00:42:56:16/40:00:b2:00:00/00 Emask 0x9 (media error)
2011-05-16T22:08:37.860931+06:00 (none) kernel: [270302.230940] ata1.00: status: { DRDY ERR }
2011-05-16T22:08:37.860932+06:00 (none) kernel: [270302.230943] ata1.00: error: { UNC }
2011-05-16T22:08:38.000754+06:00 (none) kernel: [270302.372899] ata1.00: configured for UDMA/133
2011-05-16T22:08:38.070857+06:00 (none) kernel: [270302.438758] ata1.01: configured for UDMA/133
2011-05-16T22:08:38.070867+06:00 (none) kernel: [270302.438870] ata1: EH complete
<some more similar message blocks>
2011-05-16T22:09:21.080543+06:00 (none) kernel: [270345.362698] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
2011-05-16T22:09:21.080557+06:00 (none) kernel: [270345.362703] ata1.00: BMDMA stat 0x64
2011-05-16T22:09:21.080560+06:00 (none) kernel: [270345.362708] ata1.00: failed command: READ DMA EXT
2011-05-16T22:09:21.080562+06:00 (none) kernel: [270345.362717] ata1.00: cmd 25/00:08:40:56:16/00:00:b2:00:00/e0 tag 0 dma 4096 in
2011-05-16T22:09:21.080565+06:00 (none) kernel: [270345.362720] res 51/40:00:42:56:16/40:00:b2:00:00/00 Emask 0x9 (media error)
2011-05-16T22:09:21.080567+06:00 (none) kernel: [270345.362724] ata1.00: status: { DRDY ERR }
2011-05-16T22:09:21.080569+06:00 (none) kernel: [270345.362728] ata1.00: error: { UNC }
2011-05-16T22:09:21.280463+06:00 (none) kernel: [270345.564456] ata1.00: configured for UDMA/133
2011-05-16T22:09:21.350489+06:00 (none) kernel: [270345.629995] ata1.01: configured for UDMA/133
2011-05-16T22:09:21.350500+06:00 (none) kernel: [270345.630014] sd 0:0:0:0: [sda] Unhandled sense code
2011-05-16T22:09:21.350503+06:00 (none) kernel: [270345.630018] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2011-05-16T22:09:21.350506+06:00 (none) kernel: [270345.630024] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
2011-05-16T22:09:21.350509+06:00 (none) kernel: [270345.630031] Descriptor sense data with sense descriptors (in hex):
2011-05-16T22:09:21.350511+06:00 (none) kernel: [270345.630035] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
2011-05-16T22:09:21.350513+06:00 (none) kernel: [270345.630049] b2 16 56 42
2011-05-16T22:09:21.350516+06:00 (none) kernel: [270345.630055] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
2011-05-16T22:09:21.350520+06:00 (none) kernel: [270345.630062] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 b2 16 56 40 00 00 08 00
2011-05-16T22:09:21.350523+06:00 (none) kernel: [270345.630075] end_request: I/O error, dev sda, sector 2987808322
2011-05-16T22:09:21.350525+06:00 (none) kernel: [270345.630087] ata1: EH complete
So, from log above we can see a failing sector number. The funny thing is fsck unable to see those bad blocks, so we'll be using "badblocks" util. Let's calculate block numbers for it. As I had two files, I had messages about two sectors. Usually block size is 4096 but may vary, to be sure, check yours with:
Code:
dumpe2fs -h /path/to/block/device
Code:
bad sectors 2987808322 = real(4096) 373476040 //assumed 373470000 2994276891 = real(4096) 374284611 //assumed 374285000
Code:
badblocks -b 4096 -o bad_sda1 -s /dev/sda1 374285000 373470000
Okay, let's move on with marking them:
Code:
# fsck.ext2 -l bad_sda1 /dev/sda1 e2fsck 1.41.12 (17-May-2010) anime: Updating bad block inode. Pass 1: Checking inodes, blocks, and sizes Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 85729291: 373475760 373475784 Multiply-claimed block(s) in inode 85729293: 374284352 374284355 Pass 1C: Scanning directories for inodes with multiply-claimed blocks Pass 1D: Reconciling multiply-claimed blocks (There are 2 inodes containing multiply-claimed blocks.) File /anime/Princess_Mononoke/VIDEO_TS/VTS_01_4.VOB (inode #85729291, mod time Sat May 14 11:12:32 2011) has 2 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed May 18 20:29:55 2011) Clone multiply-claimed blocks<y>? yes Error reading block 373475784 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes File /anime/Princess_Mononoke/VIDEO_TS/VTS_01_6.VOB (inode #85729293, mod time Sat May 14 12:14:16 2011) has 2 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed May 18 20:29:55 2011) Clone multiply-claimed blocks<y>? yes Error reading block 374284355 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: +373476016 +373476040 +374284608 +374284611 Fix<y>? yes Free blocks count wrong for group #0 (20983, counted=20979). Fix<y>? yes Free blocks count wrong (213400489, counted=213400485). Fix<y>? yes anime: ***** FILE SYSTEM WAS MODIFIED ***** anime: 46969/122101760 files (4.7% non-contiguous), 274977905/488378390 blocks
Though, you should not take these habits from me: I was too lazy to give HDD back to the service(well, not because of that, more because of there was also some data on this HDD I was not sure I can erase completely, using "shred" util and because of knowing the expected answer(after months of waiting): "your HDD is fine and working good, no repair required") and too lazy to mark more blocks, just in case.
Currently this HDD works just fine. Could be only those blocks are really bad. If some more will appear, I'll just use another one. By the way, this seagate HDD is quite silent and ideal for storing movies/static data which doesn't require fast operation, which was my case. This post doesn't mark that model is bad: for instance, I have one more same HDD, not having any bad blocks at all(did a full check with badblocks).
Total Comments 0




