tons of disk errors on samsung ssd after power cycle
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
tons of disk errors on samsung ssd after power cycle
My computer started emitting an alert tone and once I realized what was going on and looked inside, I realized that the CPU fan was off. Presumably it was the CPU temp alarm. I powered the machine down and let it cool for a few hours.
When I powered it back up, the fan appeared to be working normally (temp reported in the BIOS was 28), but my drive, a Samsung 850 Pro, had errors. Ran fsck. Rebooted. Ran a few system updates. Got errors. Then I rebooted again and fsck reported so many errors I couldn't get through it.
Code:
Ata3.00: status: { DRDY ERR }
Ata3.00: error: { ICRC ABRT }
Ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Ata3.00: BMDMA stat 0x26
Ata3.00: cmd ca/00:08:e8:16:40/00:00:00:00:00/e7 tag 0 dma 4096 out
Res 51/84:00:ef:16:40/00:00:00:00:00/e7 Emask 0x30 (host bus error)
Ata3.00: status: { DRDY ERR }
Ata3.00: error: { ICRC ABRT }
Blk_update_request: I/O error, dev sda, sector 121640680
Buffer I/O error on dev sda1, logical block 15204829, lost async page write
Ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
BMDMA stat 0x26
Failed command: WRITE DMA
Above is an example of some errors that were appearing during normal operation when trying to shut down (not while fsck was running).
In that sort of situation, the most reliable sign of fatal trouble is constant change. I would expect the cpu to show issues before the ssd. But as jefro said, it could be many things.
The SSD is by far the newest component on this box. I bought it in March. The CPU is from 2007, the motherboard is from 2011.
So I went ahead and got a new motherboard and CPU. I booted the new setup off an old hard drive (the one I used before I got the ssd) and I ran fsck on the ssd. There were hundreds of errors fixed. I tried to boot off the ssd and it gets a bunch of errors and gets stuck. (I made the mistake of running updates during this whole situation, which might explain why the OS would be corrupted, if fsck wasn't able to truly fix problems...but just made the drive consistent.)
So the question is: how do I know if the disk is actually bad vs. just corrupted, but still usable once repaired? I've read that smart may not be useful for SSDs. Is that true?
Is there a way to have ubuntu fix a corrupted installation?
Of course, the first thing I did once I fsck was done was copy my user files and my system configuration (/etc) to another disk. I used cp, not dd to do this, and it happened without any errors...but I have no way of knowing if the data is corrupted or not.
What specific disk checks are meaningful for a samsung ssd? Is smart meaningful?
Something that hasn't been mentioned yet is your choice of mount options . Please post them. Copy a huge file to free space and use sha1sum or some such to check how valid the copy was. Get everything valuable backed up. Strive to keeping the disk cool and lightly loaded.
The mount options are the defaults picked by the ubuntu installer, I believe:
Code:
# / was on /dev/sda1 during installation
UUID=14a8e190-ceb4-44a0-9de4-5d65cf0fd009 / ext4 errors=remount-ro 0 1
The SMART diagnostics show no errors:
Code:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 3736 -
# 2 Short offline Completed without error 00% 3735 -
I tried copying a 1.7GB file and testing with sha1sum. The sums match.
If I'm supposed to "strive to keep the disk cool and lightly loaded" then that implies that there's something wrong with the disk, in which case I should send it for warranty service.
The "strive to keep the disk cool* was in the context of getting a backup. The mount options of interest are in /etc/fstab. Some mount options (,e.g. atime) cause excessive wear on SSDs and people who know stuff will make recommendations.
The mount options I quoted above are from the fstab file. I understand the default is now reltime, which is supposed to be OK for ssds. There is the question of swap utilization, which is controlled elsewhere.
I reinstalled ubuntu and everything seems to be fine. I just checked syslog to see if there was anything in there about the disk and I didn't see any disk errors. So far so good.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.