HDD Error in centos

c0debl0ck · 10-05-2014, 02:07 PM

hi,
i am getting following error in /var/log/message.

Oct 6 00:51:55 mail kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 6 00:51:55 mail kernel: ata8.00: irq_stat 0x40000001
Oct 6 00:51:55 mail kernel: ata8.00: failed command: READ DMA EXT
Oct 6 00:51:55 mail kernel: ata8.00: cmd 25/00:08:f8:43:3f/00:00:5e:00:00/e0 tag 0 dma 4096 in
Oct 6 00:51:55 mail kernel: res 51/40:08:f8:43:3f/00:00:5e:00:00/0e Emask 0x9 (media error)
Oct 6 00:51:55 mail kernel: ata8.00: status: { DRDY ERR }
Oct 6 00:51:55 mail kernel: ata8.00: error: { UNC }
Oct 6 00:51:55 mail kernel: ata8.00: configured for UDMA/33
Oct 6 00:51:55 mail kernel: sd 7:0:0:0: [sdd] Unhandled sense code
Oct 6 00:51:55 mail kernel: sd 7:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 6 00:51:55 mail kernel: sd 7:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
Oct 6 00:51:55 mail kernel: Descriptor sense data with sense descriptors (in hex):
Oct 6 00:51:55 mail kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Oct 6 00:51:55 mail kernel: 5e 3f 43 f8
Oct 6 00:51:55 mail kernel: sd 7:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 6 00:51:55 mail kernel: sd 7:0:0:0: [sdd] CDB: Read(10): 28 00 5e 3f 43 f8 00 00 08 00
Oct 6 00:51:55 mail kernel: end_request: I/O error, dev sdd, sector 1581204472
Oct 6 00:51:55 mail kernel: ata8: EH complete
Oct 6 00:51:58 mail kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 6 00:51:58 mail kernel: ata8.00: irq_stat 0x40000001
Oct 6 00:51:58 mail kernel: ata8.00: failed command: READ DMA EXT
Oct 6 00:51:58 mail kernel: ata8.00: cmd 25/00:08:f8:43:3f/00:00:5e:00:00/e0 tag 0 dma 4096 in
Oct 6 00:51:58 mail kernel: res 51/40:08:f8:43:3f/00:00:5e:00:00/0e Emask 0x9 (media error)
Oct 6 00:51:58 mail kernel: ata8.00: status: { DRDY ERR }
Oct 6 00:51:58 mail kernel: ata8.00: error: { UNC }
Oct 6 00:51:58 mail kernel: ata8.00: configured for UDMA/33
Oct 6 00:51:58 mail kernel: ata8: EH complete

can any one help me to resolve this please?

EDDY1 · 10-05-2014, 04:01 PM

What type of drive is sdd because it is failing, or you need to change the cable or ribbon. If it is a hdd, backup any info you can.

metaschima · 10-05-2014, 04:50 PM

Backup your data, run a SMART long test using 'smartctl -t long /dev/sdd', wait for it to finish and post 'smartctl -a /dev/sdd'.

Ratamahatta · 10-05-2014, 10:07 PM

This is quite difficult to diagnose with what little information we've got so far. The problem is that messages like these can have four reasons (as far as I've found):
1. A dying drive. (That's why metaschima's first advice was to back up your data.)
2. Faulty Connection to device. (That's why EDDY1 suggested to replace the cable/ribbon. Sometimes it also helps to unplug the cable (both ends), clean it and then re-insert it. Had that with ram modules a few times.)
3. Western Digital's infamous faulty NCQ implementation. (May not be it, as I can see no "lost interrupt state" messages. See http://iwtf.net/2011/05/19/western-d...s-under-linux/ for more info. To fix it, set the Queue size to 1 (one of the links off the first one describe how).)
4. Interrupt conflict. (That is, some other device uses the same interrupt vector as your drive. There are no "lost interrupt state" entries in your excerpt, so that might not be it. If it is that and your hd driver (hardware) is the one on the motherboard, you'll probably just have to live with it, as the OS should make sure no data is lost. If it's a pci card, unplug and insert it to another slot.)

We'll know more after you posted the smartctl test results (the tests may take some hours).
In the meantime tell us one more details:
Does the system "hang" when those errors occur?
How old is the hard drive and are there any "strange" (scratching) noises?
How old is the motherboard/which CPU do you have? (IE: Does your mainboard support the drive at all?)
Did this start after you dropped the PC/laptop/drive?

c0debl0ck · 10-08-2014, 12:13 AM

Dear EDDY1,
it is a SATA HDD.

Dear metaschima & Ratamahatta,
i am facing problem while i try to backup data. it is taking too long.

i tried " smartctl -a /dev/sdd"

and found below result

smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: Hitachi HDS721010DLE630
Serial Number: MSE5215V1VVL4U
Firmware Version: MS2OA600
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Oct 8 11:02:21 2014 BDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command
from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 120) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (7895) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off supp ort.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 132) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 093 093 016 Pre-fail Always - 1048594
2 Throughput_Performance 0x0005 138 138 054 Pre-fail Offline - 85
3 Spin_Up_Time 0x0007 122 122 024 Pre-fail Always - 185 (Average 189)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 33
5 Reallocated_Sector_Ct 0x0033 001 001 005 Pre-fail Always FAILI NG_NOW 1958
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 113 113 020 Pre-fail Offline - 35
9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 19446
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 91
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 91
194 Temperature_Celsius 0x0002 150 150 000 Old_age Always - 40 (Lifetime Min/Max 20/57)
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 2133
197 Current_Pending_Sector 0x0022 065 065 000 Old_age Always - 808
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 1587 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1587 occurred at disk power-on lifetime: 19388 hours (807 days + 20 hours)
When the command that caused the error occurred, the device was active or idle .

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 d0 83 3f 0e Error: UNC 8 sectors at LBA = 0x0e3f83d0 = 239043536

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 d0 83 3f e0 00 23:43:12.669 READ DMA EXT
ef 10 02 00 00 00 a0 00 23:43:12.668 SET FEATURES [Reserved for Serial A TA]
27 00 00 00 00 00 e0 00 23:43:12.668 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 23:43:12.666 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 23:43:12.666 SET FEATURES [Set transfer mode]

Error 1586 occurred at disk power-on lifetime: 19388 hours (807 days + 20 hours)
When the command that caused the error occurred, the device was active or idle .

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 d0 83 3f 0e Error: UNC 8 sectors at LBA = 0x0e3f83d0 = 239043536

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 d0 83 3f e0 00 23:43:09.504 READ DMA EXT
35 00 08 c8 73 14 e0 00 23:43:09.488 WRITE DMA EXT
35 00 30 98 73 14 e0 00 23:43:09.470 WRITE DMA EXT
ca 00 18 70 52 94 ee 00 23:43:09.468 WRITE DMA
ca 00 80 00 5b 94 ee 00 23:43:09.460 WRITE DMA

Error 1585 occurred at disk power-on lifetime: 19388 hours (807 days + 20 hours)
When the command that caused the error occurred, the device was active or idle .

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b0 7c 3f 0e Error: UNC 8 sectors at LBA = 0x0e3f7cb0 = 239041712

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 b0 7c 3f e0 00 23:36:04.119 READ DMA EXT
ef 10 02 00 00 00 a0 00 23:36:04.119 SET FEATURES [Reserved for Serial A TA]
27 00 00 00 00 00 e0 00 23:36:04.119 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 23:36:04.117 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 23:36:04.117 SET FEATURES [Set transfer mode]

Error 1584 occurred at disk power-on lifetime: 19388 hours (807 days + 20 hours)
When the command that caused the error occurred, the device was active or idle .

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b0 7c 3f 0e Error: UNC 8 sectors at LBA = 0x0e3f7cb0 = 239041712

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 b0 7c 3f e0 00 23:36:00.940 READ DMA EXT
ef 10 02 00 00 00 a0 00 23:36:00.940 SET FEATURES [Reserved for Serial A TA]
27 00 00 00 00 00 e0 00 23:36:00.940 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 23:36:00.938 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 23:36:00.938 SET FEATURES [Set transfer mode]

Error 1583 occurred at disk power-on lifetime: 19388 hours (807 days + 20 hours)
When the command that caused the error occurred, the device was active or idle .

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b0 7c 3f 0e Error: UNC 8 sectors at LBA = 0x0e3f7cb0 = 239041712

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 b0 7c 3f e0 00 23:35:57.761 READ DMA EXT
ef 10 02 00 00 00 a0 00 23:35:57.761 SET FEATURES [Reserved for Serial A TA]
27 00 00 00 00 00 e0 00 23:35:57.761 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 23:35:57.759 IDENTIFY DEVICE
ef 03 42 00 00 00 a0 00 23:35:57.759 SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error
# 1 Extended offline Completed: read failure 80% 19389 248 505216

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

some system get hanged and recover automatically. no need for re-start.

Ratamahatta · 10-15-2014, 12:20 PM

Quote:

Originally Posted by c0debl0ck

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 093 093 016 Pre-fail Always - 1048594
...
5 Reallocated_Sector_Ct 0x0033 001 001 005 Pre-fail Always FAILI NG_NOW 1958
...
196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 2133
...
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error
# 1 Extended offline Completed: read failure 80% 19389 248 505216

May be totally wrong as the formatting got lost when pasting and I don't work for Hitachi, but those lines sound bad to me. (As in: replace the drive.)

Did you remember to back up as suggested previously? (That's always the safest thing to do.)

The 80% thing to me may also indicate that the test was still running.

If you want to be sure about any of those lines, get in touch with your vendor (Hitachi) as those values/test results are specific to their hardware.

metaschima · 10-15-2014, 06:50 PM

Yup, it is failing. Backup your data now or if you have a lot of data to backup you could use ddrescue to image the drive or partition to a larger drive and then crave data from that.

The long test ended in read failure meaning that there are bad blocks. Reallocated_Sector_Ct also means bad blocks.