tons of disk errors on samsung ssd after power cycle

adrianmariano · 09-22-2016, 07:38 AM

My computer started emitting an alert tone and once I realized what was going on and looked inside, I realized that the CPU fan was off. Presumably it was the CPU temp alarm. I powered the machine down and let it cool for a few hours.

When I powered it back up, the fan appeared to be working normally (temp reported in the BIOS was 28), but my drive, a Samsung 850 Pro, had errors. Ran fsck. Rebooted. Ran a few system updates. Got errors. Then I rebooted again and fsck reported so many errors I couldn't get through it.

Code:

Ata3.00: status: { DRDY ERR }
Ata3.00: error: { ICRC ABRT }
Ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Ata3.00: BMDMA stat 0x26
Ata3.00: cmd ca/00:08:e8:16:40/00:00:00:00:00/e7 tag 0 dma 4096 out
                Res 51/84:00:ef:16:40/00:00:00:00:00/e7 Emask 0x30 (host bus error)
Ata3.00: status: { DRDY ERR }
Ata3.00: error: { ICRC ABRT }
Blk_update_request: I/O error, dev sda, sector 121640680
Buffer I/O error on dev sda1, logical block 15204829, lost async page write
Ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
BMDMA stat 0x26
Failed command: WRITE DMA

Above is an example of some errors that were appearing during normal operation when trying to shut down (not while fsck was running).

So what does this mean? Is the disk dead?

jefro · 09-22-2016, 02:32 PM

Not sure.

Could be cpu bad, ram bad, drive controller bad, drive bad or almost any other part.

Remove SSD from this system and place in known good and see if you can get OEM diags or run smart tools on it to start.

Can run memtest also on suspect system.

business_kid · 09-29-2016, 04:53 AM

In that sort of situation, the most reliable sign of fatal trouble is constant change. I would expect the cpu to show issues before the ssd. But as jefro said, it could be many things.

adrianmariano · 09-29-2016, 06:46 AM

The SSD is by far the newest component on this box. I bought it in March. The CPU is from 2007, the motherboard is from 2011.

So I went ahead and got a new motherboard and CPU. I booted the new setup off an old hard drive (the one I used before I got the ssd) and I ran fsck on the ssd. There were hundreds of errors fixed. I tried to boot off the ssd and it gets a bunch of errors and gets stuck. (I made the mistake of running updates during this whole situation, which might explain why the OS would be corrupted, if fsck wasn't able to truly fix problems...but just made the drive consistent.)

So the question is: how do I know if the disk is actually bad vs. just corrupted, but still usable once repaired? I've read that smart may not be useful for SSDs. Is that true?

Is there a way to have ubuntu fix a corrupted installation?

pan64 · 09-29-2016, 07:00 AM

Quote:

how do I know if the disk is actually bad vs. just corrupted, but still usable once repaired

try to save its content (with dd or similar) and run some disk checks

Quote:

Is there a way to have ubuntu fix a corrupted installation?

I would prefer a full reinstall, but first save your personal/important data.

adrianmariano · 09-29-2016, 07:17 AM

Of course, the first thing I did once I fsck was done was copy my user files and my system configuration (/etc) to another disk. I used cp, not dd to do this, and it happened without any errors...but I have no way of knowing if the data is corrupted or not.

What specific disk checks are meaningful for a samsung ssd? Is smart meaningful?

business_kid · 09-30-2016, 01:36 AM

Something that hasn't been mentioned yet is your choice of mount options . Please post them. Copy a huge file to free space and use sha1sum or some such to check how valid the copy was. Get everything valuable backed up. Strive to keeping the disk cool and lightly loaded.

adrianmariano · 09-30-2016, 05:37 AM

The mount options are the defaults picked by the ubuntu installer, I believe:

Code:

# / was on /dev/sda1 during installation
UUID=14a8e190-ceb4-44a0-9de4-5d65cf0fd009 /               ext4    errors=remount-ro 0       1

The SMART diagnostics show no errors:

Code:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3736         -
# 2  Short offline       Completed without error       00%      3735         -

I tried copying a 1.7GB file and testing with sha1sum. The sums match.

If I'm supposed to "strive to keep the disk cool and lightly loaded" then that implies that there's something wrong with the disk, in which case I should send it for warranty service.

business_kid · 10-01-2016, 01:46 AM

The "strive to keep the disk cool* was in the context of getting a backup. The mount options of interest are in /etc/fstab. Some mount options (,e.g. atime) cause excessive wear on SSDs and people who know stuff will make recommendations.

adrianmariano · 10-01-2016, 06:09 AM

The mount options I quoted above are from the fstab file. I understand the default is now reltime, which is supposed to be OK for ssds. There is the question of swap utilization, which is controlled elsewhere.

business_kid · 10-02-2016, 01:44 AM

Sorry, that was off screen on my tablet and I missed it.

I think it is

Code:

relatime

and not relatime. You can also use noatime as a mount option. relatime has a write every 15 seconds, instead of 5 seconds in the default.

How stands the disk now?

adrianmariano · 10-02-2016, 06:12 AM

Yeah, relatime is what it is.

I reinstalled ubuntu and everything seems to be fine. I just checked syslog to see if there was anything in there about the disk and I didn't see any disk errors. So far so good.