Help needed debugging disk errors
Hi,
Slackware: 14.2 Kernel: 4.4.29 LVM2 based disk management. I would like to ask you how should I start debugging errors related to the disk? Below is an example of the problem. It happened after: 1) removepkg kernel-modules 2) installpkg kernel-modules 3) sync 4) echo 3 > /proc/sys/vm/drop_caches But I had also similar situation when rsync-ing Slackware mirror. Any help appreciated! Code:
[213373.488513] sd 0:0:0:0: [sda] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06 Best regards, Andrzej Telszewski |
Perhaps get a new disk and restore from backup? Is this a conventional, rotating disk or an SSD? It is awfully unusual for writes to fail on a rotating disk unless the drive has exhausted its supply of spare sectors. For an SSD, becoming read-only is a common way for the drive to protect your data when the drive starts to fail.
What is the output from "smartctl -A /dev/sda"? (Please wrap it in [CODE]...[/CODE] tags to preserve formatting.) |
Quote:
So, before any reboots on a suspected failing SSD, it is best to ensure that any necessary data is backed up. |
Hi,
The machine is an online.net's server, running just a couple (maybe 3) of months. The disk is (or at least should be) fairly new and not used much. It's an HDD. It looks like the problem starts when there is more write I/O, e.g. rsync, but I was rsync-ing before without problems. The machine has 16GB of RAM and it's almost always fully used by buff/cache, if that matters. Maybe it's only my bad feeling, but the shell prompt feels kinda sluggish, e.g. "free -m" takes sometimes seconds to execute. It used to be fast in the past. BTW, I had to run fsck to boot the system at all again; there were those strange (TM) messages about orphaned inodes, something about bitmap, etc. Should I consider the filesystem/OS pending re-installation, e.g. could there be some file's permissions broken or some config file being corrupted? Re-installing is not a big deal, just time consuming. BTW2, if I recall correctly, SMART was disabled on the disk and I enabled it with some smartctl switch some days ago, if that matters. Unfortunately I don't remember exactly what it was and .bash_history does not have it. Code:
$ smartctl -A /dev/sda Best regards, Andrzej Telszewski |
The filesystem errors are probably a result of the disk starting to fail. The 196 events aren't good, the 197 events are bad. See wikipedia for a description.
Me, I'd get a new disk - doesn't matter how old the current one is. |
Code:
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 16 |
I have to agree that the disk doesn't look good. If those bad sectors were in some laptop that was getting bounced around while it was running I wouldn't get too excited about them, but it's not something you want to see in a server, especially one that's been running only ~118 days. Disks are cheap. Get rid of it.
|
The drive encountered write errors on certain sectors of the disk. The sector numbers are given in the log output, eg 9096736, 4272896, 8955624, 8955368, etc.
First thing to do is make a backup. Do it now. The drive could fail catastrophically. However, sometimes the number of "bad sectors" will stabilise, and you might get useful life out of the disk in the future. It all depends on whether the values in the critical SMART parameters above keep increasing. Usually a disk will reallocate (remap) bad sectors on a write, but for a drive with 4096 byte physical sectors (a "4K drive"), it can only do this if you re-write the entire 4k sector. Most operating systems use 512-byte logical sectors, so there are 8 logical sectors per 4K drive physical sector. See whether you have a "4k" drive by running "smartctl -a /dev/sda", and looking for this: Quote:
Now you can use: Quote:
If you're feeling lucky, you can then use: Quote:
Repeat for all the identified bad sectors... The problem now is you have a filesystem with some random "holes" in it. These could be within user files (in which case the file will now be corrupted), or filesystem meta-data, or just in unallocated space (if you're lucky!). It's possible to perform filesystem "forensics" to determine what was allocated to those sectors, but it's not trivial. For now, just force fsck the filesystem and allow it to repair any structural errors. Keep all the logs, as they may be useful for forensics in the future. Keep an eye on the SMART stats to see if any of items #5 & #196-198 continue to increase. If they do, it's probably time to replace the drive. FWIW, on our storage platform we normally replace drives if they exceed 600 bad sectors. However, the storage is mirrored (RAID1) so there are two copies of everything enabling lost sectors to be restored from the mirrored disk. Regards, Marc |
Hi,
Thank you all for your replies. I reported the problem to online.net support and after investigation they replaced the disk*. But I don't mind if you keep posting interesting stuff here. I did have a look at SMART, but to be honest, it is really hard to make sense of it if you don't have experience. Even the replies to this thread aren't straightforward (i.e. replace immediately vs might be recoverable). *) I can access my old server in rescue mode to recover data, and there is already second server machine waiting to switch to. -- Best regards, Andrzej Telszewski |
Quote:
|
Quote:
Come to think of it, how did that inode ever get into memory if the sector containing it is bad? |
there are hundreds of thousands of replacement sectors to be used. Yours are used up get a new disk.
|
Quote:
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
This is a good question, I just had a disk failure, 0 reallocated sectors, yet disk had plenty of bad sectors. Maybe SMART is not that smart after all.
|
Hi,
An update :-^ Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE Best regards, Andrzej Telszewski |
Make
Code:
#smartctl -t long /dev/sda By example my 3TB disk test takes about 5 hours. After end of test make Code:
smartctl -a /dev/sda |
That increase in the pending sector count doesn't necessarily mean that anything changed. A bad sector won't be discovered and marked "pending" until something tries to read it.
I have to wonder, though, whether something might have turned off the drive's automatic defect management. That would explain the write error on the bad sector. I thought that modern drives no longer had the ability to turn that off, but perhaps yours is one of the exceptions. See the paragraph for the "-D" option in the hdparm manpage. |
Quote:
|
Hi,
For all of you SMART people (no pun intended :-)), after smartctl -t long (yes, I waited for the requested time before using -a switch): Code:
$ smartctl -a /dev/sda Best regards, Andrzej Telszewski |
Code:
1 Extended offline Completed: read failure 90% 2879 9548728 |
Quote:
If you really want to find out how many bad sectors there are, run Code:
dd if=/dev/sda of=/dev/null bs=4k conv=noerror |
Hi,
Just a side question. Would it be wise to go with 2 SSD-s in RAID-1 configuration? That's probably something that I could afford from the monetary point of view. Please note that it's my favorite toy machine. I want it to be the best possible, within sensible budget. Loss of data wouldn't cause major injuries, and there are backups too. It just feels better with the uptime ticking up continuously :-) -- Best regards, Andrzej Telszewski |
RAID-1 is for read speed. No redundancy really. Isn't SSD already fast enough for you?
|
Hi,
Quote:
Aren't there two copies? -- Best regards, Andrzej Telszewski |
Two copies, yes. One gets corrupted the other one gets corrupted, too. Only in case one drive dies suddenly the other one will have the data intact.
|
Hi,
Quote:
So I would need something with error correction. I'm goon have a look at the possibilities, but most probably I'm gonna give up on the idea. Thanks. -- Best regards, Andrzej Telszewski |
RAID-1 will protect against data loss due to a drive failure. That is one cause of data loss. There is no form of RAID that protects against the other causes of data loss, such as accidental deletion, overwriting, OS failures that corrupt the filesystem, etc. RAID is not a substitute for backups. And of course RAID adds its own complexity and modes of failure to the mix. Its primary function is to allow a system to keep running seamlessly while a failed drive is replaced. If that is important vs. the hours of down time while a failed drive is replaced and restored from backup, then you need RAID. Otherwise, not so much, aside from the bragging rights about your continuous uptime (assuming that your drives are hot-swappable -- which they probably are not).
|
Hi,
There was no possibility to upgrade the hardware of this server. I changed to the same class one, with 250GB SSD. 2 moving parts less to wear out ;-) -- Best regards, Andrzej Telszewski |
All times are GMT -5. The time now is 11:48 PM. |