Is my harddrive going bad?

PhilD · 08-13-2018, 08:30 PM

I have a 2TB internal WD Blue drive that is a little over 4 years old. It is connected via SATA. It is used as a second drive and is mounted to a folder in my user directory were I store most of my media files for Plex along with a few other less frequently accessed files. Right now there is about 1.4TB of data on the drive.

Several weeks ago I updated to Mint Linux 19 and installed an new graphics card. After rebooting I was faced with the emergency console. Eventually I realized the issue was a failed FSCK on the drive and forced a check / repair on the drive. It then booted normally.

Prior to this I had been having issues with Plex playing media. I had not connected the two issues until after. Since my scan, I have noticed three time now the drive access slowing to a crawl (takes 1-3 minutes to populate the file explorer when accessing the mount). I have changed the mount point to not mount automatically so I can reboot the computer and scan the drive. Each time there were hundreds of inode issues on the ext4 single partition. Each time they are all cleared up and the drive functions correctly for several days. My lost+found is increasing, but everything seems to be working correctly.

I have tried a different SATA channel with no difference. I have to reboot to scan because when it start acting up it can't unmount the drive, or at least it still says it is in use even when unmounted. I have run smartctl and am not sure exactly how to read the results other than the header says it passes and there doesn't seem to be a large number of bad sectors. I have included the latest run below as an attachment.

I don't love the idea of buying a new drive (or several - I am considering a RAID setup), but I also don't want to lose everything on this one by waiting and I can still access it for now. If it is an issue of a corrupted filesystem or something else that can be repaired I would like to give that a shot before replacing hardware. Of course, if it is something more serious like my SATA support on my motherboard... well I may just have to bury my head in the sand. :-)

Thank you for any help or suggestions!!

syg00 · 08-13-2018, 08:44 PM

You have backups (plural) ?. And no, even if you RAID it, that doesn't serve as backup.

Quote:

... and there doesn't seem to be a large number of bad sectors.

I don't have a good grasp of what is an acceptable number, but something approaching zero is where I'd be. Same for those read errors.

Quote:

My lost+found is increasing, but everything seems to be working correctly.

Anything in there is (usually) a portion of a file that has been truncated. I never see that as a good thing.

Get your data backed up, and get a new disk. The cost is trivial compared to bemoaning lost data.

frankbell · 08-13-2018, 08:54 PM

Have you run fsck on it and scanned it with smartmon tools?

I second syg00's advice. Back up any crucial data to external media as soon as you can. Four years is a bit early for an HDD to go bad, but stuff does wear out.

PhilD · 08-13-2018, 09:15 PM

The "Must Have" data is backed up off site. Much of the media content I don't have backed up but can recreate it if needed. I have run fsck, hince the inode issues. I have also ran smartmon tools - those results were attached to my post. I agree that the cost of the drives are much cheaper than they use to be. I guess my biggest concern is verifying it is in fact a drive issue. The behavior sure looks like it, I am just missing data in the diagnosis tools that corroborate that fact.

Yes, I know RAID is not a back up.

Also, my understanding is that the lost+found is partial of files where the inode reference was missing. They could be left over from relocating parts of files do to a read or write error so the actual files may still be fully intact. Also, the drive, I believe, is showing zero permanently bad sectors as SMART allows for relocation and then a second attempt to write before marking it permanent. That is why I was a little concerned it could be the SATA controller. Am I missing something in how this works or the smartctl report?

Thanks again for the help and responses!!

frankbell · 08-13-2018, 09:52 PM

The classic troubleshooting move is to replace the suspect part with a known-good part, but I must concede that doing so would be a bit difficult as regards the lone SATA controller in a computer.

If the other SATA drives are performing properly, that would indicate to me that the SATA controller is probably okay, but it might be worth while to shut the machine down, then remove and reseat the misbehaving drive. Perhaps a contact has gotten fouled.

I know it's a long shot, but sometimes even long shots pay off.

PhilD · 08-13-2018, 09:59 PM

Quote:

Originally Posted by frankbell

The classic troubleshooting move is to replace the suspect part with a known-good part, but I must concede that doing so would be a bit difficult as regards the lone SATA controller in a computer.

If the other SATA drives are performing properly, that would indicate to me that the SATA controller is probably okay, but it might be worth while to shut the machine down, then remove and reseat the misbehaving drive. Perhaps a contact has gotten fouled.

I know it's a long shot, but sometimes even long shots pay off.

Agreed!! I do have my primary SATA drive on channel 1 and it is working perfectly fine. I have changed the the suspect drive from channel 2 to 6. From what I can tell my Mobo uses the first 4 as one possible RAID setup and the last two separately. I thought this might be indicative of different controllers. Regardless that didn't make a difference. I don't have (or haven't found) a spare SATA cable to try. I may go dig around again to make sure there isn't one hiding somewhere. That would also be an easy test. I have reseatted (reset?) the cable in use.

rknichols · 08-13-2018, 10:09 PM

The Current_Pending_Sector count of 233 is bad news. Those are bad sectors that have not been reallocated and will cause an I/O error whenever the OS tries to read them. That is an unconfortably large number and suggests that the drive is failing. It is an internal issue with the drive. Problems with the SATA controller or cable would not cause that. Note that SMART will not consider the drive to be failing until the number of bad sectors approaches the number of spare sectors, but that is far beyond the point that most people would consider the drive to have failed.

You could try zeroing and reformatting the drive and then restoring the data from backup. That should cause any pending sectors to be either written successfully or reallocated to spare sectors. The are events like vibration or power supply glitches that can cause a sector to appear bad when there is nothing actually wrong with the recording surface. For that to affect 233 sectors seems unlikely.

PhilD · 08-14-2018, 12:33 AM

Quote:

Originally Posted by rknichols

The Current_Pending_Sector count of 233 is bad news. Those are bad sectors that have not been reallocated and will cause an I/O error whenever the OS tries to read them. That is an unconfortably large number and suggests that the drive is failing. It is an internal issue with the drive. Problems with the SATA controller or cable would not cause that. Note that SMART will not consider the drive to be failing until the number of bad sectors approaches the number of spare sectors, but that is far beyond the point that most people would consider the drive to have failed.

You could try zeroing and reformatting the drive and then restoring the data from backup. That should cause any pending sectors to be either written successfully or reallocated to spare sectors. The are events like vibration or power supply glitches that can cause a sector to appear bad when there is nothing actually wrong with the recording surface. For that to affect 233 sectors seems unlikely.

That makes sense. I say the sector count and it tends to fluctuate. Once after I had competed a fsck it was only 58. It was that they never became reallocated or uncorrectable that threw me off. I was wondering if something external was causing read / write issues that were recovered from when performing the fsck. From your explanation, though, it seems the drive is definitely having issues, but it also seems like SMART is less helpful than I was expecting it to be.

syg00 · 08-14-2018, 01:02 AM

SMART can only report what the drive hardware is telling it. Some (all ?) manufacturers are happy to obfuscate the numbers and/or the meaning of the attributes.
The drive controller manages reallocation, not SMART.

You also appear to have a somewhat sanguine view of fsck and what may be dropped into lost+found. Cross-linked inodes result in one of the files being truncated - if you have been getting them how do you know all your files are still valid ?. How many "broken" files have made their way into your backups ?.
If I get major fsck errors, I reformat and restore immediately - no questions asked. This usually only happens in storms - I have a UPS but tropical storms occasionally create spikes that still get through.

rknichols · 08-14-2018, 09:16 AM

If that Current_Pending_Sector count fluctuates without any sectors being reallocated, that suggests that something is interfering with the drive's write operations. For external causes, the most likely suspect is vibration. What else is in the housing with that drive? Drives can be quite sensitive to vibration. I saw one video showing that simply shouting loudly at a rack of disk drives significantly impaired throughput of read operations. I can only imagine how that would affect writes. I personally had experience with a cartridge tape drive with a bad roller causing a disk drive in the same housing to become almost unreadable while that tape drive was in use.