(yes, I agree that the boot/root SSD is mysteriously failing i.e. 'disappearing' off-line)
Can you [easily] get another SSD? (if identical, maybe connect external & simply dd copy [UUID same?])
Or *maybe* 'mirror' [raid/multipath?] the SSD to free space somewhere on other 3 hdd...
I'm wondering if there's a 'better' way to capture console messages (just SEE them; not depend on any disk logging)...
** What 'console'/display do you use, when you find the system frozen?
(Could you leave -like- a serial dumb-terminal console, where displaying last messages isn't dependent on kernel video display. Or maybe remotely, to avoid any screensaver-blanking, need to [load&run] login, ...)
I found something interesting
here(middle),
(more info and .../power/control):
Quote:
To spin down the SDD platters and power off the individual disks, to simulate disks being taken out or prior to taking out the disks, issue the following:
echo 1 > /sys/block/sda/device/delete
...
|
I tried this with my VirtualBox CentOS7 netinst CD /sr0/ [only; no hdd] (Ctrl-Alt-F1 thru F5)
I got lots of kernel messages on tty4 (Alt-F4: console logs) but [tty2&3] `df /`loop&top kept running ok (new cmds [on tty5] from /bin failed w/msgs of course) Also fdisk -l instead of df catches it gone.
Of course, don't take my 'noob' ramblings as *anywhere near* expert...!
(getting inanely obscure, breakpoint/debug-log [SSD?]driver.c/.ko sending such...)
I assume that /etc and [/usr]/bin&lib are on the SDD, 'mostly' readonly [not /var on SSD]
While the system is down, maybe you could boot single [ro ok for safety] & try the 'delete',
&reset, to see whether a full power-cycle is needed to bring SDD back online (per BIOS)
I wonder if some software could be ['accidentally'] doing something like this...
(blame systemd, *LOL* ... *JUST KIDDING*, sorry)
I'm looking for some way to 'prove' that what is SEEN, fully matches the SDD 'shutting off'.