LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Blogs > In The Middle Of Nowhere [edK's blog]
User Name
Password

Notices

In The Middle Of Nowhere

Life, universe and everything...
--
Blog title (C) Orbital
Rate this Entry

How does I/O error for HDD read looks like?

Posted 05-20-2011 at 04:01 AM by Web31337

A bit of scary story with happy ending. Below I describe how failing HDD may look in your syslogs, and how I "resolved" this issue.
NOTICE: when you see I/O errors anywhere, or you see something like this in syslog YOUR FIRST TASK IS TO MAKE BACKUPS AS FAST AS POSSIBLE!!! ONLY WHEN YOU ARE SURE YOUR DATA IS NOW BACKED UP AND SAFE, you may proceed with something like "fixing"!

Recently had an accident involving few bad blocks on HDD. The target was Seagate 5900 rpm LP 2TB disk(identifier: ST32000542AS, CC34), quite new with one partition, formatted in ext2. While reading two files system hangs in iowait and then finally returns I/O error status.
Of course, I made backups to another HDD, assuming entire HDD can crash and then decided to investigate it a bit closer. Here's how it looks in syslog:
Code:
2011-05-16T22:08:37.860907+06:00 (none) kernel: [270302.230913] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
2011-05-16T22:08:37.860921+06:00 (none) kernel: [270302.230918] ata1.00: BMDMA stat 0x64
2011-05-16T22:08:37.860923+06:00 (none) kernel: [270302.230924] ata1.00: failed command: READ DMA EXT
2011-05-16T22:08:37.860926+06:00 (none) kernel: [270302.230933] ata1.00: cmd 25/00:00:18:56:16/00:02:b2:00:00/e0 tag 0 dma 262144 in
2011-05-16T22:08:37.860929+06:00 (none) kernel: [270302.230935]          res 51/40:00:42:56:16/40:00:b2:00:00/00 Emask 0x9 (media error)
2011-05-16T22:08:37.860931+06:00 (none) kernel: [270302.230940] ata1.00: status: { DRDY ERR }
2011-05-16T22:08:37.860932+06:00 (none) kernel: [270302.230943] ata1.00: error: { UNC }
2011-05-16T22:08:38.000754+06:00 (none) kernel: [270302.372899] ata1.00: configured for UDMA/133
2011-05-16T22:08:38.070857+06:00 (none) kernel: [270302.438758] ata1.01: configured for UDMA/133
2011-05-16T22:08:38.070867+06:00 (none) kernel: [270302.438870] ata1: EH complete
<some more similar message blocks>
2011-05-16T22:09:21.080543+06:00 (none) kernel: [270345.362698] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
2011-05-16T22:09:21.080557+06:00 (none) kernel: [270345.362703] ata1.00: BMDMA stat 0x64
2011-05-16T22:09:21.080560+06:00 (none) kernel: [270345.362708] ata1.00: failed command: READ DMA EXT
2011-05-16T22:09:21.080562+06:00 (none) kernel: [270345.362717] ata1.00: cmd 25/00:08:40:56:16/00:00:b2:00:00/e0 tag 0 dma 4096 in
2011-05-16T22:09:21.080565+06:00 (none) kernel: [270345.362720]          res 51/40:00:42:56:16/40:00:b2:00:00/00 Emask 0x9 (media error)
2011-05-16T22:09:21.080567+06:00 (none) kernel: [270345.362724] ata1.00: status: { DRDY ERR }
2011-05-16T22:09:21.080569+06:00 (none) kernel: [270345.362728] ata1.00: error: { UNC }
2011-05-16T22:09:21.280463+06:00 (none) kernel: [270345.564456] ata1.00: configured for UDMA/133
2011-05-16T22:09:21.350489+06:00 (none) kernel: [270345.629995] ata1.01: configured for UDMA/133
2011-05-16T22:09:21.350500+06:00 (none) kernel: [270345.630014] sd 0:0:0:0: [sda] Unhandled sense code
2011-05-16T22:09:21.350503+06:00 (none) kernel: [270345.630018] sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2011-05-16T22:09:21.350506+06:00 (none) kernel: [270345.630024] sd 0:0:0:0: [sda]  Sense Key : Medium Error [current] [descriptor]
2011-05-16T22:09:21.350509+06:00 (none) kernel: [270345.630031] Descriptor sense data with sense descriptors (in hex):
2011-05-16T22:09:21.350511+06:00 (none) kernel: [270345.630035]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
2011-05-16T22:09:21.350513+06:00 (none) kernel: [270345.630049]         b2 16 56 42 
2011-05-16T22:09:21.350516+06:00 (none) kernel: [270345.630055] sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed
2011-05-16T22:09:21.350520+06:00 (none) kernel: [270345.630062] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 b2 16 56 40 00 00 08 00
2011-05-16T22:09:21.350523+06:00 (none) kernel: [270345.630075] end_request: I/O error, dev sda, sector 2987808322
2011-05-16T22:09:21.350525+06:00 (none) kernel: [270345.630087] ata1: EH complete
Actually, I had no much choice, since HDD is 2TB size and few bad blocks for such length can be an acceptable issue, which may be resolved by blacklisting failing sections.
So, from log above we can see a failing sector number. The funny thing is fsck unable to see those bad blocks, so we'll be using "badblocks" util. Let's calculate block numbers for it. As I had two files, I had messages about two sectors. Usually block size is 4096 but may vary, to be sure, check yours with:
Code:
dumpe2fs -h /path/to/block/device
Here are my blocks...
Code:
bad sectors

2987808322
	= real(4096) 373476040
	//assumed    373470000

2994276891
	= real(4096) 374284611
	//assumed    374285000
Next step: find out all bad blocks starting from these. You can choose full check with badblocks util, but that may take many hours. I am lazy guy, not really caring for that data since I have another copy, so I just calculated those above and issued the following command:
Code:
badblocks -b 4096 -o bad_sda1 -s /dev/sda1 374285000 373470000
After few minutes of iowait I had four bad blocks in bad_sda1 output file.
Okay, let's move on with marking them:
Code:
# fsck.ext2 -l bad_sda1 /dev/sda1
e2fsck 1.41.12 (17-May-2010)
anime: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 85729291: 373475760 373475784
Multiply-claimed block(s) in inode 85729293: 374284352 374284355
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 2 inodes containing multiply-claimed blocks.)

File /anime/Princess_Mononoke/VIDEO_TS/VTS_01_4.VOB (inode #85729291, mod time Sat May 14 11:12:32 2011) 
  has 2 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed May 18 20:29:55 2011)
Clone multiply-claimed blocks<y>? yes

Error reading block 373475784 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

File /anime/Princess_Mononoke/VIDEO_TS/VTS_01_6.VOB (inode #85729293, mod time Sat May 14 12:14:16 2011) 
  has 2 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed May 18 20:29:55 2011)
Clone multiply-claimed blocks<y>? yes

Error reading block 374284355 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +373476016 +373476040 +374284608 +374284611
Fix<y>? yes

Free blocks count wrong for group #0 (20983, counted=20979).
Fix<y>? yes

Free blocks count wrong (213400489, counted=213400485).
Fix<y>? yes


anime: ***** FILE SYSTEM WAS MODIFIED *****
anime: 46969/122101760 files (4.7% non-contiguous), 274977905/488378390 blocks
After doing that kernel will no longer access those blocks, and also, I am able to operate with broken files(as they now moved from bad blocks).
Though, you should not take these habits from me: I was too lazy to give HDD back to the service(well, not because of that, more because of there was also some data on this HDD I was not sure I can erase completely, using "shred" util and because of knowing the expected answer(after months of waiting): "your HDD is fine and working good, no repair required") and too lazy to mark more blocks, just in case.
Currently this HDD works just fine. Could be only those blocks are really bad. If some more will appear, I'll just use another one. By the way, this seagate HDD is quite silent and ideal for storing movies/static data which doesn't require fast operation, which was my case. This post doesn't mark that model is bad: for instance, I have one more same HDD, not having any bad blocks at all(did a full check with badblocks).
Posted in hardware
Views 4230 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 07:35 PM.

Main Menu
Advertisement

My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration