[SOLVED] Diagnosing hardware problems

ghollisjr · 12-25-2021, 01:11 AM

I recently came across a possibly unique situation where one of my Linux systems suffers from random crashes with very scarce diagnostic information. E.g.,

* There are no dmesg logs stating a problem prior to the crash. There are no system logs in general that signal a problem prior to the crash.
* A Kexec crash kernel is never launched despite being configured to do so with kexec -p ...
* The crashes are seemingly random, as everything from system load to the running software can vary. The time of each crash is random, sometimes within a few minutes of boot, sometimes outside of 8 hours of uptime.
* Memtest, CPU stress tests, and GPU stress tests all succeed without warnings of temperature/stability/etc.
* lm_senors temperatures all look OK before and up to the moment of a crash. Not just OK, but almost always at least 30-50 C below the alarm temperatures.

All of the above tells me there is most likely something wrong at the motherboard/CPU level, but since I am not experienced with trying to diagnose these kinds of issues, I was hoping others might have some advice/insight as to what steps are used in a Linux system to diagnose these problems.

mrmazda · 12-25-2021, 01:29 AM

Systemd installations have a persistent journal option that typically is not enabled by default. When enabled, it's often possible to find logged clues to problems in the journal of the prior boot:

Code:

sudo journalctl -b -1

Enabling persistence is ostensibly via /etc/systemd/journald.conf, but I find it's more reliable just to create /var/log/journal.

Yours could well be a motherboard and/or CPU problem, but it also could be an unstable or inadequate power supply.

If you have multiple RAM sticks, try doing without one at a time to see if it is nevertheless memory related, possibly not RAM itself, but RAM socket or memory controller.

How old is this PC? Your hardware itself could be a clue. Please provide output from inxi -Bayz run after you have run sudo inxi -U (-U to ensure your distro hasn't provided you a broken old inxi version).

ghollisjr · 12-28-2021, 02:11 PM

It is an old system (~10 years for CPU + mobo + psu), but the RAM, GPU, and all the drives are <1 year old.

inxi -B:

Quote:

Battery:
Message: No system battery data found. Is one present?

Makes sense, there's no battery.

inxi -ayz:

Quote:

Use of uninitialized value in string eq at /usr/bin/inxi line 17848.
Use of uninitialized value in string eq at /usr/bin/inxi line 17848.
Use of uninitialized value in string eq at /usr/bin/inxi line 17848.
Use of uninitialized value in string eq at /usr/bin/inxi line 17848.
Use of uninitialized value in string eq at /usr/bin/inxi line 17848.
CPU: quad core Intel Core i7-3820 (-MT MCP-) speed/min/max: 3204/1200/3800 MHz
Kernel: 5.15.10-arch1-1 x86_64 Up: 29m Mem: 5893.1/30010.1 MiB (19.6%)
Storage: 7.99 TiB (66.9% used) Procs: 354 Shell: Bash 5.1.12 inxi: 3.3.11

Not sure about the "unitialized value..." messages, but the rest looks right.

ghollisjr · 12-28-2021, 02:15 PM

I did collect a few crash logs with journalctl. Here's a couple:

Quote:

Dec 21 17:00:01 ghollisjr-desktop kernel: audit: type=1106 audit(1640124001.559:101): pid=2574 uid=0 auid=0 ses=3 msg='op=PAM:session_close grantors=pam_loginuid,pam_limits,pam_unix acct="root" exe="/usr/bin/crond" hostname=? addr=? terminal=cron res=success'
Dec 21 17:01:01 ghollisjr-desktop CROND[2590]: (root) CMD (run-parts /etc/cron.hourly)
Dec 21 17:01:01 ghollisjr-desktop CROND[2589]: (root) CMDEND (run-parts /etc/cron.hourly)
Dec 21 17:04:11 ghollisjr-desktop dhcpcd[1083]: wlan0: failed to renew DHCP, rebinding
Dec 21 17:04:25 ghollisjr-desktop dhcpcd[1083]: wlan0: leased 192.168.0.100 for 3600 seconds

Quote:

Dec 25 02:22:23 ghollisjr-desktop rtkit-daemon[1420]: Successfully made thread 24232 of process 1812 owned by '1000' RT at priority 10.
Dec 25 02:22:23 ghollisjr-desktop rtkit-daemon[1420]: Supervising 4 threads of 4 processes of 1 users.
Dec 25 02:28:33 ghollisjr-desktop rtkit-daemon[1420]: Supervising 4 threads of 4 processes of 1 users.
Dec 25 02:28:33 ghollisjr-desktop rtkit-daemon[1420]: Supervising 4 threads of 4 processes of 1 users.
Dec 25 02:28:36 ghollisjr-desktop rtkit-daemon[1420]: Supervising 4 threads of 4 processes of 1 users.
Dec 25 02:28:36 ghollisjr-desktop rtkit-daemon[1420]: Supervising 4 threads of 4 processes of 1 users.

I briefly thought that the "wlan0: failed..." messages might be a hint that the WIFI drivers were causing a problem since there were a few logs where renewing the IP address from the router was the last thing in the logs, but I subsequently had crashes while using ethernet only. The rtkit-daemon messages seem to just be the watchdog system on Arch giving diagnostic messages about new processes starting.

mrmazda · 12-28-2021, 06:18 PM

I made a mistake. inxi -Bayz should have been inxi -bayz.

a, y & z are inxi modifiers.

Quote:

Uinitialized value in string eq

is because when you used inxi -ayz you signaled nothing to be modified.

frankbell · 12-28-2021, 07:37 PM

It's not uncommon in a system crash for events leading to the crash not to get logged because the crash happens before the logs can be written.

Have you checked that the cooling vents and fans are free of obstruction and working properly? Overheating is the most common cause of unexpected poweroffs.

Crippled · 12-28-2021, 07:37 PM

Other than running memory tests and drive tests, logs are useless for detecting hardware problems because when the offending hardware causes a problem the O.S. will stop writing to the log because the O.S. freezes. Only by swapping out hardware will work to find the offending hardware. What make and model is your motherboard?

uteck · 12-28-2021, 08:39 PM

A system that old is a prime candidate for bad capacitors. https://en.wikipedia.org/wiki/Capacitor_plague
Open up the box and look for capacitors that are round at the top, or have crud leaking from them. The Wiki link above has some good photos.

A failing capacitor can have many adverse affects on a system that may not show up in the logs.

ondoho · 12-29-2021, 08:06 AM

Adding my 2ct's worth of a wild guess...

I recently had something similar happen. There was no overheating, but the computer's insides were very dusty. I took a vacuum cleaner to it, then compressed air (sucked out the worst, blew out the rest). Haven't had a crash since (knock on wood).

mrmazda · 12-29-2021, 09:01 AM

Quote:

Originally Posted by ondoho

Adding my 2ct's worth of a wild guess...

Nothing wild about it. Accumulated dust is a heat insulator, prevents optimal cooling, or enough cooling. If the environment is a smoker's, accumulated tar helps the dust to accumulate, insulate, and stick. I have an Asus A88X motherboard we retired at about age 5 due to tar dust preventing booting, and simple blowout with compressed air didn't help. A couple of years later I got ambitious, washed the board with soap, water and contact cleaner, heated it to 200F in the oven, and it was resurrected. Now it's back in service.

dugan · 12-29-2021, 09:51 AM

Start by unplugging or changing the USB devices. I'm serious.

It could very well be blocking on IO. If it seems to happen during disk access, then suspect the disk. If it seems to happen during network access, then suspect a network device.

I recently dealt with this; I was getting hard-locks with nothing in smartctl or journalctl. They seemed to happen during web browsing (e.g. when writing Reddit posts). I replaced the wifi dongle and the USB hub, and things have been fine since.

kilgoretrout · 12-29-2021, 10:52 AM

In my experience, that type of random issue is usually either ram or psu related. Since your ram passed memtest, your psu is the most likely candidate. A wonky psu can randomly go out of spec and cause all types of problems on a random time schedule. The only sure fire way of diagnosing is to replace the psu with a known good working psu. Reading the voltages with a meter won't work since everything can be good one moment and go out of spec the next.
As uteck mentioned, bad capacitors could be an issue as well but they tend to fail a lot sooner than 10 years. I would definitely check for it however.

h2-1 · 12-31-2021, 12:43 AM

The inxi error is actually a bug in inxi, lol, at first I thought it wasn't, but it will trigger in cases where you use -a and inxi alone, a combination that I would never have tried using, which is why it didn't ever manifest. That error is corrected and will be in inxi 3.3.12 (in pinxi now). To trip that required having both md raid present and using inxi -a, which has no actual purpose, but it is supported so the error shouldn't have happened, thanks for doing that unlikely sequence of actions. The cause was trying to use a test that had not been set for inxi short form, but should have been. Never noticed it myself since I never use inxi -a, since it has no meaning, and I don't use mdraid, but luckily I have access to a system that does, and it showed the same error so it was easy to figure out and fix.

I would suggest doing a full offline disk scan to start. I've almost never seen memory fail memtest unless it was literally blowing up and suffering catastrophic failures, but it's good to check it anyway because it's easy to do. I don't think I've ever seen a mobo test show anything meaningful. Checking mobos requires opening the system up physically and looking for failing capacitors (tops swell up forming a dome), which in some cases are easy to spot, other cases, not so much.

Capacitor failure seems to be much more a function of quality of original than time, but they do contain fluids, and those fluids do eventually go away, at which point the capacitor is no longer capable of being a capacitor, at which point unless you are very good with microsoldering, it's time to check the board and call it a day. I generally look for what they call 'long life capacitors', which are built differently, not those cylinders with the cross top that bursts up and swells right before the board dies on boards that used those.

The PSU advice is solid, I suspect it depends largely on the brand, I've almost never had any issues with higher end psus like antec, they seem to be largely indestructable, I have to replace them because their board connector types are obsolete long before I have to replace them due to failure. But even they had one bad production run, as suggested, bad capacitors were the culprit, which was the only time I ever saw one of those die, which it did, in a streaking line of flame running down the power cables to the board and drives, lol...

Assuming you are doing the basics, and have the computer attached to a battery backup unit, not the wall directly, which can by the way also start to roast your system after a while, this sounds like hardware failure to me. Now, if you do not have it behind a real battery backup unit that can kick in for at least a few seconds, ideally a few minutes, it's possible you have intermittent power issues, but if you are using a battery backup ups then that's not an issue, except in one very corner case, where something in the house is pulling the power to just above the trip point of the ups, but below the actual board's ability to deal with. I've had several client machines fail inexplicably and in very short time, and both instances were directly caused by something heavy duty pulling power out of the circuit, I noticed it one case because I noticed the lights in the office flicker/dim ever so slightly when a heavy duty garbage disposal unit was used, and in the other, it turned out the clients office was next to a heavy industrial place that had some very high amperage equipment that would drop the line voltages enough to roast the board.

Best course, if you have an extra psu, try replacing it, but if you don't, don't sweat it, it's unlikely, possible, but unlikely. More likely, and this depends on what your board actually is, is premature capacitor death syndrome (PCDS). I haven't experienced that since only buying boards that use the long life capacitors though. If you don't have a ups, your hardware could be getting killed by subtle voltage surges and irregularities.

If you have disk damage, which a full disk scan will show, that will also cause these crashes. Note that it's a rare event for hardware crashes to be logged unless it's a peripheral device that is failing, those will often get logged since the core os and hardware is able to run, and thus, able to log. Disk failures can result in errors in the logs, unless those failures are causing the system to die instantly, at which point of course the logs will show nothing. But if you do a disk scan, and you get errors, and fix them, consider that disk dying but still able to be backed up, errors don't actually get 'fixed', they get bypassed on the drive, and if you have several errors, that's a bad sign and suggests the disk is dying.

USB is also a good suggestion, those type of peripherals can and have destabilized the system, I had those types of issues with failing usb hubs for example, and failing usb controller cards, both cases significantly destabilized the os.

Note that: sudo inxi -da
will show a full disk report, assuming you have smartctl installed, and of course, if you see any smart failures there, the issue is almost certainly your disks, which is good, because besides the psu and ram, those are the easiest to replace.

mrmazda · 12-31-2021, 01:16 AM

Quote:

Originally Posted by h2-1

I generally look for what they call 'long life capacitors', which are built differently, not those cylinders with the cross top that bursts up and swells right before the board dies on boards that used those.

That "cross" is a safety feature found in some fashion or other for on all top brand, long life electrolytic capacitors, e.g. Panasonic, UCC, Rubycon, Nichicon, as well as typically on the counterfeits and cheapies. When swelling occurs, it can press against the weakness the cross comprises, so that pressure is released slowly at the top instead of bursting out the bottom or sides, which can more effectively contaminate the board on which it's soldered.

PC motherboards 10 or more years ago started switching the most important (mostly larger, often all) caps from electrolytics to polys, which by their more expensive contruction aren't subject to leakage or explosion, have longer lives, and for any given specification set, are smaller.

h2-1 · 12-31-2021, 01:25 AM

Oh, thanks for clarifying, I knew they had switched to something, the new capacitors look very different, I hadn't looked closely enough at their tops to see, but they definitely did not have that typical cylinder with deep scribed expansion cross, but I never actually took a closer look at the new kinds. The new caps are way better, that's for sure, and if I open up a case and see a mobo with the old caps, I'm more inclined to just chuck it than spend time on it, and of course, if they are swelling, I will definitely toss it since it will be dying very soon, if it runs at all anymore. I took a look just now, the new caps don't have that cross on the top, they are flat surfaced, but look much more solid and substantial than the old kind. Might be worth the OP's time to open his case up and just look at the caps, on the old ones, if they are failing, it's very visible, since that top starts to swell up along the stress relief lines of the x on top, but the new ones look very smooth and flat. Any swollen old type caps and it's time to chuck the board.

Particularly check out the caps clustered around the cpu, if my memory serves, those are the ones that tended to die first, but check all of them since any one can be the cause.