Hardware diagnosis and common symptoms of hardware failure

TobiSGD · 05-11-2013, 09:23 PM

This article is an adaption of the article about the same topic from the Slackware Documentation Project, originally written by fellow member H_TeXMeX_H.
The article is published under the CC Attribution-Share Alike 3.0 Unported license.

If you have any questions, suggestions or want to discuss the contents of this article please post in this thread.

Warning: DISCLAIMER: We are not liable for any damage caused to your system by reading this howto, running diagnostics programs, or implementing fixes. This howto is for informational purposes only.

Before working inside the case you should:

1. Power off your computer, turn off the PSU, and unplug it.
2. Wear an antistatic wrist strap.
3. Ground yourself by first touching the PSU.
4. Be careful not to damage motherboard components with sharp, hard tools.
5. Don't use force to remove components and check the manual for how to remove them.
6. Don't do anything you are not comfortable with.
7. If you don't know what you are doing, get an expert to do it.
8. Don't have water or other conductive liquids near the computer or work area.
9. Don't leave any metallic or conducting objects inside the case as they may short-circuit components.

Common Symptoms

These are just common symptoms of each component failure, and are rarely clear enough to diagnose hardware error right away. Run the diagnostics software to confirm suspicions. Sometimes more than one thing can fail at once, so that will be harder to diagnose, you may need to take it to a shop.

Power-on self-test

Quote:

Power-On Self-Test (POST) refers to routines which run immediately after many digital electronic devices are powered on. Perhaps the most widely known usage pertains to computing devices (personal computers, PDAs, networking devices such as routers, switches, intrusion detection systems and other monitoring devices). Other devices include kitchen appliances, avionics, medical equipment, laboratory test equipment—all embedded devices. The routines are part of a device's pre-boot sequence. Once POST completes successfully, bootstrap loader code is invoked.

POST includes routines to set an initial value for internal and output signals and to execute internal tests, as determined by the device manufacturer. These initial conditions are also referred to as the device's state. They may be stored in firmware or included as hardware, either as part of the design itself, or they may be part of semiconductor substrate either by virtue of being part of a device mask, or after being burned into a device such as a programmable logic array (PLA).

Test results may either be displayed on a panel that is part of the device, or output via bus to an external device. They may also be stored internally, or may exist only until the next power-down. In some cases, such as in aircraft and automobiles, only the fact that a failure occurred may be displayed (either visibly or to an on-board computer) but may also upload detail about the failure(s) when a diagnostic tool is connected.

POST protects the bootstrapped code from being interrupted by faulty hardware. Diagnostic information provided by a device, for example when connected to an engine analyzer, depends on the proper function of the device's internal components. In these cases, if the device is not capable of providing accurate information—which ensures that the device is safe to run—subsequent code (such as bootstrapping code) may not be permitted to run.

Beep Codes

Progress and error reporting

Quote:

The original IBM BIOS made POST diagnostic information available by outputting a number to I/O port 80 (a screen display was not possible with some failure modes). Both progress indication and error codes were generated; in the case of a failure which did not generate a code, the code of the last successful operation was available to aid in diagnosing the problem. Using a logic analyzer or a dedicated POST card, an interface card that shows port 80 output on a small display, a technician could determine the origin of the problem. Once an operating system is running on the computer the code displayed by such a board may become meaningless, since some OSes, e.g. Linux, use port 80 for I/O timing operations. The actual numeric codes for the possible stages and error conditions differ from one BIOS supplier to another. Codes for different BIOS versions from a single supplier may also vary, although many codes remain unchanged in different versions.

Later BIOSes used a sequence of beeps from the motherboard-attached loudspeaker (if present and working) to signal error codes. Some vendors developed proprietary variants or enhancements, such as MSI's D-Bracket. POST beep codes vary from manufacturer to manufacturer.

Information on numeric and beep codes is available from manufacturers of BIOSes and motherboards. There are websites which collect codes for many BIOSes.[1]

Original IBM POST beep codes

POST AMI BIOS beep codes

IBM POST diagnostic code descriptions

Central BIOS Beep codes for most manufactures & BIOS upgrade downloads

RAM

The computer typically does not POST or boot properly, but it will try.
Each boot typically causes different symptoms, i.e. the boot will halt in a different place each time.
The fans are typically running, often at 100%.
There may be beep codes, so check your manual or wiki to see what they mean.
Randomly occurring kernel panics and segmentation faults in running programs.

PSU

Sudden shutdown or reboot or hang without warning or logs of what went wrong.
Sudden shutdown or reboot or hang may occur during high system load or power usage or even when idle.
Sudden system hangs that cannot be recovered using SysRq REISUB keys, and cannot SSH into the system. Audio that is playing may loop continuously during the hang.
You may notice a strange smell coming directly from the PSU as it overheats.
The system may not boot after pressing the power button, and you may need to press it more than once.
Rarely it dies completely, like after a power surge, and the fans will not be running and it will not POST or boot and there may be motherboard damage.

CMOS battery

The time is reset on each boot.
The BIOS settings reset on boot.

HDD

I/O errors in the logs.
Filesystem corruption.
May POST but may not finish booting properly.
Disk access slows down right before failure.
Strange noises such as clicks, grinding, and spinning up noises.
The BIOS will not detect a completely dead disk.

GPU

Graphical glitches on screen.
If it is really dead the screen will be black.
May causes system hangs, and audio may be looping during the hang.
There may be messages in the logs relating to the video drivers or an Xorg crash or nothing at all.

CPU

May or may not POST.
Kernel panics are possible with multi-core machines, when only one core is affected.
If beep codes are heard, check the wiki
The fans are typically running at 100%.
If it is overheating it may trigger a MCE (Machine-check exception) which will cause a kernel panic and forced shutdown, or it may throttle itself down and the system will seem slower.

Motherboard

Check for swollen capacitors.
A dead motherboard does not POST.
The fans connected to the motherboard may not be running, the ones connected to the PSU may be running at 100%.
You may notice a high-pitched (squealing/screeching) noise coming from it, due to failing capacitors.

CD/DVD drives

Won't read or write disks properly.
May keep spinning up and down forever while trying to read a disk.
May not open when you press the button, but instead make some clunking sounds and open only after many button presses.

Diagnostics

Make sure to check the cables/connectors/pins to make sure they are not damaged or corroded. Oxidation may build up on the DIMM pins, and may need to be removed. This should be done carefully only with 99% isopropyl alcohol and a soft cotton q-tip. Hard objects should be avoided inside the case as they may break off pins or damage components. If you don't feel comfortable, get a professional to do it.

Run one of the following programs. Errors are typically found during the first run, but do more runs proportional to your suspicion.
- memtest86+
- memtester

These RAM testing programs also test the CPU, so if the DIMMs are known good on other systems, maybe the CPU is the cause.

PSU

There is no specific test for the PSU. However, you should monitor voltages. You can usually do this in the BIOS, and these are the most accurate. Make sure all the voltages are above their stated voltage. For example the +3.3V line should be greater than 3.3V, +5V line should be greater than 5V, and most importantly the +12V rail should be greater than 12V. Note that this does NOT mean that you should increase the voltage if it is low, this is usually done automatically by the PSU. Normally, voltages should be stable about a certain value, above the critical value. If the PSU is failing, the voltages can vary quickly and are close to the critical value. For example, a good PSU will have a +3.3V line voltage of a stable 3.35 (just an example). A bad PSU will have a variable +3.3V voltage that quickly varies between 3.30 and 3.32. Planning ahead, when you first get a new PSU, you should write down the voltages and save them for reference and monitor them over time. If there is no option in the BIOS to monitor the voltages you can use a voltmeter or multimeter to measure the voltages directly on the connectors. The pinout for the connectors can be found here: ATX 20-pin, ATX 24-pin. Alternatively, special meters for PSUs are available in electronic shops.
In theory, it shuts down, reboots, or hangs under increased power usage because it cannot provide the needed current and overheats as a result. As such, perhaps the best way to diagnose it is by monitoring voltages under load, so you should install lm_sensors and configure it using sensors-detect, and then use a monitoring program to monitor voltages and warn when they go below the given limit. Then you load the system with some CPU or GPU intensive application and wait for the warnings or sudden shutdown/reboot/hang. Technically, the shutdown/reboot/hang can occur even when idle, and the voltage drop may be too fast for any alarm to catch.

You should ALWAYS use a surge protector on ALL electronic devices ALL the time. This prevents damage to the PSU, to the motherboard, and it saves you lots of time and money wasted on replacing electronics damaged by power surges. Many surge protectors come with a warranty that will refund a certain amount of money if your equipment is damaged while using the surge protector properly. The surge protector is cheap, the equipment is expensive, and the refund usually more than covers it.

If you value that investment in sensitive electronic equipment then some form of power line protection, filtering or conditioning should be used. A cheap form is inline surge protector but most cheap units are a one time use. Please look at 'Surge Protector' for helpful information;

Quote:

A surge protector (or surge suppressor) is an appliance designed to protect electrical devices from voltage spikes. A surge protector attempts to limit the voltage supplied to an electric device by either blocking or by shorting to ground any unwanted voltages above a safe threshold. This article primarily discusses specifications and components relevant to the type of protector that diverts (shorts) a voltage spike to ground; however, there is some coverage of other methods.

The terms surge protection device (SPD), or the obsolescent term transient voltage surge suppressor (TVSS),
are used to describe electrical devices typically installed in power distribution panels, process control systems, communications systems,
and other heavy-duty industrial systems, for the purpose of protecting against electrical surges and spikes, including those caused by lightning.
Scaled-down versions of these devices are sometimes installed in residential service entrance electrical panels, to protect equipment in a household from similar hazards.[1] A surge protection device mounted on a residential circuit breaker panel

Many power strips have basic surge protection built in; these are typically clearly labeled as such. However, power strips that do not provide surge protection are sometimes
erroneously referred to as "surge protectors".

Another device to use would be a Uninterruptible Power Source (UPS) that will provide filtering and line conditioning for sensitive electronic equipment. Usually cost is a factor when choosing a UPS. Please consider looking at UPS for some useful information;

Quote:

An uninterruptible power supply, also uninterruptible power source, UPS or battery/flywheel backup, is an electrical apparatus that provides emergency power to a load when the input power source, typically mains power, fails. A UPS differs from an auxiliary or emergency power system or standby generator in that it will provide near-instantaneous protection from input power interruptions, by supplying energy stored in batteries or a flywheel. The on-battery run time of most uninterruptible power sources is relatively short (only a few minutes) but sufficient to start a standby power source or properly shut down the protected equipment.

A UPS is typically used to protect computers, data centers, telecommunication equipment or other electrical equipment where an unexpected power disruption could cause injuries, fatalities, serious business disruption or data loss. UPS units range in size from units designed to protect a single computer without a video monitor (around 200 VA rating) to large units powering entire data centers or buildings. The world's largest UPS, the 46-megawatt, Battery Electric Storage System (BESS), in Fairbanks, AK, powers the entire city and nearby rural communities during outages.[1]

CMOS battery

You could take the battery out, making sure to use the special tab to remove it rather than trying to pry it off, and measure its voltage. Or you could just throw it away in the proper battery disposal container and replace it anyway just to be sure, and also because you may have had to remove a graphics card or PCI/PCIE card to reach the battery and you may not have a battery tester for these types of batteries.

HDD

You can either run a smartctl long test, which tests the entire disk surface for errors, and updates the offline attributes, or you can run the specific proprietary manufacturer's utility. smartctl will also show the HDD temperature and airflow temperature. Make sure the temperatures are below 60 C.
- smartmontools including smartctl (many distributions have this installed by default
- SystemRescueCD including smartctl, ddrescue, TestDisk and foremost.
- Ultimate Boot CD including lots of FLOSS diagnostics and recovery utilities as well as the proprietary manufacturers utilities.

ALWAYS backup your data regularly and run smartctl regularly or use smartd. If you feel that the HDD is dying, don't bother running the utilities first, instead backup your important stuff immediately. If the entire disk is full of your important stuff, maybe you should image the entire drive to another HDD in case it fails before you can get your data off it. Then you can run data carving utilities to carve your data off the image. Again, there is no substitute for backing up your data, your HDD can FAIL at ANY time WITHOUT WARNING from SMART or any diagnostics program.

You can run a SMART long test by running

Code:

smartctl -t long /dev/sd?

You then have to wait the time it estimates, plus a few more minutes for the result which you can check with

Code:

smartctl -a /dev/sd?

The attributes are listed, but you can check them separately with

Code:

smartctl -A /dev/sd?

Here is an important note on attributes from man smartctl

Code:

Each  Attribute  also has a Threshold value (whose range is 0 to
              255) which is printed under the heading "THRESH".  If  the  Nor-
              malized value is less than or equal to the Threshold value, then
              the Attribute is said to have failed.  If  the  Attribute  is  a
              pre-failure Attribute, then disk failure is imminent.

              The Attribute table printed  out  by  smartctl  also  shows  the
              "TYPE"  of  the  Attribute.  Attributes  are one of two possible
              types: Pre-failure or Old age.  Pre-failure Attributes are  ones
              which, if less than or equal to their threshold values, indicate
              pending disk failure.  Old age, or usage  Attributes,  are  ones
              which  indicate end-of-product life from old-age or normal aging
              and wearout, if the Attribute value is less than or equal to the
              threshold.   Please  note: the fact that an Attribute is of type
              'Pre-fail' does not mean that your disk is about  to  fail!   It
              only  has  this  meaning  if  the Attribute´s current Normalized
              value is less than or equal to the threshold value.

If you have a laptop/netbook and you hear clicks from the HDD once in a while, it may be that the power saving feature is spinning down the drive. This saves power, but can quickly wear down the drive. You can turn it off by running this on every boot, basically just add it it /etc/rc.d/rc.local
Code:
hdparm -B 254 /dev/sd?

GPU

Video Memory Stress Test is also available on sourceforge and UBCD. It has the limitation that it cannot always recognize the amount of video RAM properly, and the DOS version cannot recognize more than 512 MB of video RAM. From experience, it works well with integrated Intel cards, and not very well with Nvidia or ATI cards.
CUDA GPU memtest requires a video card that either supports CUDA such as an nivdia card with the development nvidia drivers and CUDA installed, or a video card that support opencl which can be an nvidia or ATI card plus opencl installed and supported by the drivers. The test is comprehensive and the authors claim it can detect soft and hard errors. From experience, it may not detect hard errors.
If the above don't work, then you could just run a video game in benchmark mode and watch for graphical glitches on the screen or system crashes. The problem with this method is that there is no way to know if the drivers are at fault or if the card is at fault, unless you have prior experience with the game and the glitches or crashes are new.

Sadly, none of the GPU tests I have tried are reliable in detecting hardware errors.

CPU

Although memtest86+ tests the CPU as well as RAM, there is a more specific test:
- Great Internet Mersenne Prime Search is very accurate to CPU errors in mode 1 (Small FFTs) and a bit less so in mode 2 (Large FFTs) and it will report any errors that occur. It is a great way to differentiate between a RAM and CPU error. Let it run until the CPU temperature is stable and then as long as you like proportional to your suspicion. Errors are typically found rather quickly, so you don't have to wait too long.

To check the CPU temperature make sure you have lm-sensors installed. Configure it by running
Code:
sensors-detect
and then copy the modules it needs modprobe'd to /etc/rc.d/rc.local, make it executable, and run it. To check temperatures run
Code:
sensors
or you can use a monitor of your choice. sensors will also list critical temperatures, but the most accurate temperatures are shown in the BIOS, so do a comparison.

Motherboard

Sadly, there is no reliable software test for motherboard errors. The diagnosis is mostly a process of elimination. You can also take it to a shop and have them test the motherboard.
Being able to pinpoint the motherboard as the source of a high-pitched squealing/screeching noise is reliable for diagnosing bad capacitors, a common problem with motherboards.

CD/DVD drive

Burn a disk iso and run:
Code:
```
cmp input.iso /dev/sr0
```
It should say:
Code:
```
cmp: EOF on input.iso
```
If it does not say that, then the DVD/CD was not burned properly. However, it may also be because of bad media or high burn speed. If the drive keeps spinning up and down while reading a disk it could be failing or it could be that you are trying to play a commercial DVD whose region code is not supported by the drive, which limits it to 1x read speed.

Solutions

An overheating system can cause instability and may mimic failing hardware. Before doing anything else it may be worthwhile to get a compressed air canister and clean the dust out from all the fans, heat sinks, and all the hard to reach places where dust accumulates inside the case. Using a brush is not as effective and may damage components, so it should only be used outside the case. It is also important to make sure there is proper airflow inside the case:

At the bare minimum you should have a large 120mm output fan in the rear of the case for an ATX motherboard. For smaller systems it varies, and only a smaller fan may fit.
Make sure that cables inside the case do not obstruct airflow, and if they do then use plastic cable ties to fix it if possible. Do not leave any stray metal inside the case that could short-circuit components.
Make sure fans are placed so that they cause air movement across hot components, or at the very least evacuate hot air from the case.

RAM
Remove the DIMMs/RAM sticks one at a time, and then check again with memtest86+. If it fails, replace the stick, remove another and run the test again. Rarely, more than one DIMM may be failing, so take that into account. It could also be the CPU if all your DIMMs fail the test.

If you cannot replace the RAM soon, like if you have an ancient computer and can't find RAM for it, you can use the mem= kernel option to force the kernel to use only good RAM. Say you have an error at 129.0 MB, like I did recently, you could use this kernel boot option:

Code:

mem=128M

and it will only use the first 128 MB of RAM, omitting the bad part. There is also an option to exclude a section of RAM, but it is less tested and may not work.

PSU
Replace the PSU with a new PSU, if the symptoms disappear, you can be sure it was the PSU. However, note that a bad PSU may damage the motherboard or other components, or maybe it was a power surge that damaged everything.

CMOS battery
Replace the battery carefully using the special tab. Do NOT pry it off using a screwdriver because it may break and it won't go back in.

HDD
You should first get your important data off of it. If you feel it is failing fast, use ddrescue to image the drive to another drive. Once you have the image, be it complete or incomplete, you can use Testdisk and/or foremost to carve data off the image. The data will not have the same file name it used to, but at least you will get the data. You can find all these utilites and more on the SystemRescueCD. Now just replace the drive.

Warning: There exist programs that claim to correct bad blocks. What they do is mark the bad blocks so the drive doesn't used them. The problem is that bad blocks are an indicator of imminent drive failure. So, you may think that they are the fixing the problem, you may put off backing up your data, and then the drive fails and you lose your data.

GPU
Replace the video card. If it is too expensive and you aren't sure, which is very likely due to the fact of the matter, try testing it in another machine to be sure, or try testing using a spare known good video card and see if the symptoms persist.

CPU
If you suspect the CPU is overheating, then you can try to remove the heatsink-fan block according to your CPU or heatsink manual, clean off the old thermal paste, apply new thermal paste, or let an expert do it. Otherwise, replace the CPU. Make sure it is actually the CPU and not the RAM that is the problem, as a CPU is very expensive compared to RAM.

Motherboard
Replace the motherboard, and make sure the PSU is not damaged, as it can damage your new motherboard.

CD/DVD drive
Replace the drive.