LinuxQuestions.org - Offline uncorrectable sectors

- Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)

- - Offline uncorrectable sectors (https://www.linuxquestions.org/questions/linux-hardware-18/offline-uncorrectable-sectors-4175466584/)

Offline uncorrectable sectors

I'm running centos 5.9 server on an em350 netbook and on startup I get a warning:
Device: /dev/sda [SAT], 1 Offline uncorrectable sectors
is there any way to fix this? the machine is command-line only (except for webmin which is installed).

I'd suggest that you boot to the OEM hard drive diags first. Then decide which way to go.

It could be any number of issues but most likely some disk problem.

The fix is not really reliable. Any time you have data errors, there is no way to trust the rest of the data. You'd have to compare backup to the current data or use last known good backup for resolution.

If you have sensitive data on it or its destined for something important, change the disk.

You can do the following: install smartctl and do:

Code:

smartctl --attributes /dev/sda

Look for Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable to be 0. If not, some issues will happen sometimes.
There are temporary fixes such as rewriting (non-destructive) the whole disk a few times - i had bad sectors go away like that. But sometimes they came back. Used these as base:

http://www.sjvs.nl/forcing-a-hard-di...e-bad-sectors/

http://www.cyberciti.biz/faq/recover...ted-partition/

http://www.howtogeek.com/howto/37659...isk-utilities/

Particularly the (***destructive!!!***) write-sector deemed efficient every time when the drive wasnt done for good. Be aware, it will mess the file system up to a certain extent (i was lucky, but you just might lose stuff, do a full backup with all you have there!!!).

Also, the badblocks command was very useful - you can get it to rewrite your whole disk with the data that was prevously on it non-destructively - this sometimes makes bad sectors go away , but at least you will have all the bad/unreadable sectors name in the dmesg to feed to the write-sector command.
Make sure the badblocks command is used offline (boot the thing from a usb drive or something with a live image and do the operations from that.

I'd replace the disk of possible but my understanding is most disk have spare blocks and your manufacturers tool should remap around the bad sectors. However, bad sectors seem contagious and generally means doom is on the way so I would look to replace the disk as soon as possible.

thanks for your response. I did smartctl --attributes /dev/sda which gave me:

Quote:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 157694
2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 47316992
3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 1
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 4864
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 (2000, 0)
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always - 4046
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 060 060 000 Old_age Always - 20110
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2716
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 167
193 Load_Cycle_Count 0x0032 085 085 000 Old_age Always - 319835
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 44 (Min/Max 6/60)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 914
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 (0, 6469)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 12427
203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 429581793769
240 Transfer_Error_Rate 0x003e 200 200 000 Old_age Always - 0

I couldn't find the badblocks command you spoke of. I was hoping there might be a tool I could run on the system that would repair this error in place. It's worth mentioning that I don't think this is a failing drive problem. I got it after restoring an image to the drive, and because the drive wasn't 'exacttly' the same size as the image expected I got this error.

Any reason you don't want to try the factory diags?

yeah it means rebooting the server into the OEM hard drive diags and that means server downtime = no email, website and other important functions.

Quote:

Originally Posted by tonj (Post 4976240)

yeah it means rebooting the server into the OEM hard drive diags and that means server downtime = no email, website and other important functions.

sometimes server downtime is unavoidable, pick a time durring off-peak usage and do it, first and foremost, start backing up the data now
either way, a stitch in time saves nine as the saying goes, if the hard drive is failing you should know because how much downtime do you think a dead hard drive is going to cost you?

I understand your point about server downtime sometimes being unavoidable but I have full image backups and like I said in an earlier post, I don't think this is a failing drive problem. I got it after restoring an image to the drive, and because the drive wasn't 'exacttly' the same size as the image expected I got this error. Plus I tested the drive before using it and it was 100%, so for the meantime I'd like to hang on for any way to fix this in place.

the catch however is that the kind of checks that seem to be necessary would require a lower level access to the drive than is perhaps possible while there is data on the drive in use, as it would be a risk of corrupting said data, this is the same reaon you can't fsck a mounted volume, data can be corrupted if it's being changed as it's being scanned. If it were my server I'd just bite the bullet and take it off line.

I don't think this has anything to do with restoring an image, this is low level harddrive. If the drive is hot-swappable you can pull it and use another machine to run th diagnostics but the remapping is done in the hard drives firmware in my understanding.

Code:

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1