LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 09-24-2020, 02:12 PM   #1
operater
LQ Newbie
 
Registered: Sep 2020
Posts: 4

Rep: Reputation: Disabled
Troubleshooting a bricked nvme SSD (Samsung 960EVO)


Hi all! My first post here so bear with me please . I'm coming here in an attempt to explore the possibilities of talking to a stubborn nvme drive using kernel level pci methods such as by using setpci among others. My motives are pretty clear, first of all a dopamine shot from fixing this instead of RMAing my drive and ofc recovering my data.

The story is for the most part simple: enabled hyper-v on thuh windows, system BSODed and froze due to bad nvme drivers - the drive no longer responds after a forced power cycle - thnx samsung and msft!
Cold booting causes some hang during POST and the drive led is constantly blinking at the same pace no matter the environment. many environments cannot boot at all while the drive is inserted.

dmesg on ubuntu live looks like this
Code:
dmesg | grep nvme
[    0.687202] nvme nvme0: pci function 0000:02:00.0
[   31.250882] nvme nvme0: Device not ready; aborting reset
[   31.250889] nvme nvme0: Removing after probe failure status: -19
lsblk reports no block devices related to nvme at all, so no way to do anything using hdparm, etc. I had some success with semi bricked ssds using hdparm in past, on several occasions actually.

lspci reports the device's memory controller however:
Code:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
I have tried using a samsung bootable utility to reflash the firmware but the environment doesn't detect the device either and just runs a reboot script.
What I've done is unpacked the initramfs of this utility and extracted the binaries in order to try and reflash the drive from a more robust and feature packed environment. I'm wondering if there could be a way(has to be) to persuade the device through some of it's pci capabilities or other kernel level methods to stop doing what it's doing and to let me detect it and reflash it.
I'm not a beleiver in modern hardware failures, especially not when it's a 2year old 99% life left ssd. I'm quite certain this is a logical issue and there has to be a way to fix it !

Your help is loads welcome, please join me in this adventure!
 
Old 09-25-2020, 12:44 PM   #2
kilgoretrout
Senior Member
 
Registered: Oct 2003
Posts: 3,006

Rep: Reputation: 395Reputation: 395Reputation: 395Reputation: 395
There's an open source tool for dealing with nvme drives that is included in the nvme-cli package in ubuntu. Not sure if ubuntu installs it by default but if not already installed, it is available and can be installed with:
Code:
$ sudo apt install nvme-cli
Once installed, run:
Code:
$ sudo nvme list
and see if the drive is detected by the utility. If it is, there is a whole host of commands you can try to get the thing working. See:

https://nvmexpress.org/open-source-n...face-nvme-cli/

Good luck.
 
1 members found this post helpful.
Old 09-25-2020, 01:33 PM   #3
operater
LQ Newbie
 
Registered: Sep 2020
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thanks! But, just as I was afraid and given the dmesg output nvme clie doesn't seem to detect anything. Lsbslk lists no block devices either.

Given the dmesg output I am now thinking what would be the means to talk to the device on the pci level in order to give it a jolt (using it's PCI capabilites ?)
I have tried power cycling the pci slot etc... to no avail.

please haalp :'D

update: after a cold reboot nvme list now outputs(after a 30sec hang):

identify failed
NVMe status: Unknown(0x371)

update:

after trying nvme list I now see dmesg throwing up some real treasure of errors:

[ 9.097300] nvme nvme0: pci function 0000:03:00.0
[ 9.097340] nvme 0000:03:00.0: enabling device (0100 -> 0102)
[ 9.222273] nvme nvme0: 7/0/0 default/read/poll queues
[ 9.234497] nvme0n1: p1 p2
[ 41.920083] nvme nvme0: I/O 961 QID 1 timeout, aborting
[ 72.612735] nvme nvme0: I/O 961 QID 1 timeout, reset controller
[ 84.130092] nvme nvme0: I/O 16 QID 0 timeout, reset controller
[ 164.607159] nvme nvme0: Device not ready; aborting reset
[ 164.627443] blk_update_request: I/O error, dev nvme0n1, sector 976773104 op 0x0READ) flags 0x80700 phys_seg 1 prio class 0
[ 164.627511] nvme nvme0: Abort status: 0x371
[ 195.250087] nvme nvme0: Device not ready; aborting reset
[ 195.250094] nvme nvme0: Removing after probe failure status: -19
[ 225.829326] nvme nvme0: Device not ready; aborting reset
[ 225.829566] Buffer I/O error on dev nvme0n1, logical block 122096638, async page read
[ 225.834098] Buffer I/O error on dev nvme0n1p2, logical block 121947888, async page read
[ 225.834462] Buffer I/O error on dev nvme0n1p1, logical block 148208, async page read
[ 225.854260] nvme nvme0: failed to set APST feature (-19)

Last edited by operater; 09-25-2020 at 01:48 PM.
 
Old 09-25-2020, 02:18 PM   #4
uteck
Senior Member
 
Registered: Oct 2003
Location: Elgin,IL,USA
Distribution: KDE Neon
Posts: 1,216

Rep: Reputation: 509Reputation: 509Reputation: 509Reputation: 509Reputation: 509Reputation: 509
May want to give the Init Disk tool from GRC a try. https://www.grc.com/initdisk.htm
Someone reported that it got a bricked SSD to respond, so seems to work on non-USB drives as well.
Windows only, but should work in WINE also.
 
Old 09-25-2020, 05:00 PM   #5
kilgoretrout
Senior Member
 
Registered: Oct 2003
Posts: 3,006

Rep: Reputation: 395Reputation: 395Reputation: 395Reputation: 395
Is the drive properly detected in your bios setup? Have you tried any options in your bios setup re nvme drives that may help? If your overclocking, remove the OC and see if that helps. Try reseating the drive or, if you have another m.2 slot, try the other slot. The blinking light and the error messages you posted from the nvme list command indicate repeated I/O errors and that diagnosis is consistent with the other error messages you posted. Don't know if you've looked and any of the foregoing but just trying to give you some obvious things that you can overlook when you get too deep into a problem.
 
Old 09-25-2020, 05:04 PM   #6
operater
LQ Newbie
 
Registered: Sep 2020
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thanks, none of that works out. There is no OC in question either. I have tried everything in bios and have tried two different mobos. Bios does see the device since the lspci does as well.

I have managed to have dmesg throw up another error message by trying "echo 1 > enable" on the pci device in question.

pci 0000:03:00.0: Refused to change power state, currently in D3

edit:

same story again

[964.506324] pci 0000:03:00.0: Timeout waiting for NVMe ready status to clear after disable
[965.230322] pci 0000:03:00.0: timed out waiting for pending transaction; performing function level reset anyway

edit:

picking up some more variety in errors:

[ 72.767268] nvme nvme0: I/O 323 QID 4 timeout, reset controller
[ 84.288664] nvme nvme0: I/O 8 QID 0 timeout, reset controller

edit:

I have added some timeout as a kernel module parameter nvme_core.admin_timeout=990

and now dmesg goes like this:

[ 42.176239] nvme nvme0: I/O 966 QID 2 timeout, aborting
[ 50.727453] nvme nvme0: I/O 28 QID 0 timeout, reset controller
[ 73.061969] nvme nvme0: I/O 966 QID 2 timeout, reset controller
[ 243.054529] __nvme_disable_io_queues+0x17e/0x1d0 [nvme]
[ 243.054532] ? nvme_simple_resume+0x20/0x20 [nvme]
[ 243.054537] nvme_disable_io_queues+0x15/0x30 [nvme]
[ 243.054540] nvme_dev_disable+0x4da/0x4f0 [nvme]
[ 243.054545] nvme_timeout.cold+0xbc/0x187 [nvme]
[ 243.054749] nvme_dev_disable+0x3a/0x4f0 [nvme]
[ 243.054754] nvme_timeout.cold+0xbc/0x187 [nvme]

Now that the io queues are disabled maybe there is some way to talk sense to the controller. If anyone picks up any ideas along the route please shoot .

Last edited by operater; 09-25-2020 at 07:07 PM.
 
Old 09-25-2020, 07:03 PM   #7
kilgoretrout
Senior Member
 
Registered: Oct 2003
Posts: 3,006

Rep: Reputation: 395Reputation: 395Reputation: 395Reputation: 395
If you haven't already done so, boot into windows with the drive connected and try using the Magician software mentioned here:

https://www.samsung.com/semiconducto...nsumer/960evo/

If the device isn't being detected by Samsung's own utilities, either Magician or the Samsung bootable utility mentioned in your original post, I wouldn't hold out much hope. I don't know what Samsung support is like, but you could give that a try. They may have some suggestions other than rma.
 
Old 09-25-2020, 07:17 PM   #8
operater
LQ Newbie
 
Registered: Sep 2020
Posts: 4

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by kilgoretrout View Post
If you haven't already done so, boot into windows with the drive connected and try using the Magician software mentioned here:

https://www.samsung.com/semiconducto...nsumer/960evo/

If the device isn't being detected by Samsung's own utilities, either Magician or the Samsung bootable utility mentioned in your original post, I wouldn't hold out much hope. I don't know what Samsung support is like, but you could give that a try. They may have some suggestions other than rma.
Thnx but I have tried samsung utilities. No hope with windows as the driver is totally stuck. Samsung support is useless, besides rma.
I have managed to disable system IO reqs and now the devices stays visible as a block device for as long as the timeout that I've set lasts.

Now how would one proceed to jolt back to life a stuck controller using nvme-cli given the current dmesg IO errors.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Data in a Flash, Part II: Using NVMe Drives and Creating an NVMe over Fabrics Network LXer Syndicated Linux News 0 05-21-2019 12:41 AM
Migrate Linux/win10 dual boot from MBR nvme drive to a new GPT nvme drive bluemoo Linux - Software 7 09-25-2018 07:42 PM
LXer: Samsung UEFI bug: Notebook bricked from Windows LXer Syndicated Linux News 0 02-11-2013 02:20 PM
LXer: UEFI-enabled Samsung laptops get bricked with Linux LXer Syndicated Linux News 0 01-31-2013 09:00 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 05:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration