Hardware Issue (Not Slack)

BAcidEvil · 12-21-2014, 12:11 PM

I have this PC I bought about a year ago (i normally build my own what whatev( and this is what i get)) and it had Windows 8 installed by default... At various times (though usually through intense gaming (but not always))I would get a "RED" Screen of death and it would auto reboot. Sure, I would get various errors and I would research them but to no avail.

Now, I did swap out the video card, swapped memory sticks/slots and disabled On board LAN/Wireless. I am fairly confident it is either MB or something directed towards that..

I even reinstalled a fresh Win 8 that I had purchased and it still does the same thing.

My original intent was for Linux anyway so I put my Slack on it.. It runs fine but at random (sometimes 3 days, sometimes 5 mins after reboot) the screen and everything freezes and I have to do a hard boot. It mostly happens in XWindows but it has indeed happened in the shell itself w/out X running at all.
Long story short; does Linux have far better resources/tools/scans/logs to tell me what is causing my system (hardware) to fail?

sycamorex · 12-21-2014, 12:24 PM

Try memtest, get the latest BIOS firmware.

I had a similar (?) problem with my laptop (have a read, perhaps you'll find something useful)

http://www.linuxquestions.org/questi...ponent-922215/

BAcidEvil · 12-21-2014, 01:37 PM

Quote:

Originally Posted by sycamorex

Try memtest, get the latest BIOS firmware.

I had a similar (?) problem with my laptop (have a read, perhaps you'll find something useful)

http://www.linuxquestions.org/questi...ponent-922215/

I downloaded the Memtest86 v4.3.7 and booted off of USB and it ran completely through the 12 GiGs.. Came up as fine.. I assume it is a 99% accurate type thing but would still have to agree that the Memory is fine.

But, that was 1 pass. It is still running; would you let it continue or should I spend time looking at other hardware issues?

BAcidEvil · 12-21-2014, 01:39 PM

B.T.W as I said this is a new Video card... The prior one was an ATI and this is nVidia so I am safe to assume it is not the GPU BUT can be the PCI-X slot?
Something just tells me it is the MB... Also, a swapped out the HD as well.

sycamorex · 12-21-2014, 01:57 PM

I guess 1 pass *should* be enough. What about the firmware for BIOS?

You can check your mobo with the following command:

Code:

# dmidecode |grep -B 2 Stat

astrogeek · 12-21-2014, 02:03 PM

Quote:

Originally Posted by BAcidEvil

I downloaded the Memtest86 v4.3.7 and booted off of USB and it ran completely through the 12 GiGs.. Came up as fine.. I assume it is a 99% accurate type thing but would still have to agree that the Memory is fine.

But, that was 1 pass. It is still running; would you let it continue or should I spend time looking at other hardware issues?

Well, since your system fails anywhere from 5 mins to 3 days, memtest shoulld run for a similar period. A single pass only rules out hard failures.

My experience with memtest is that it is 100% on detecting failures, less for implied success. By that I mean that when it detects a failure, there really is a problem, but when it detects no error during an extended run it means only that there were no errors during that run (just like the real world!). Typically I will let it run for at least 24 hours (full local ambient thermal conditions) and when trying to spot highly transient problems as you describe, 2-3 days before I feel more confident.

The best test for suspect memory is always to swap it out if possible.

BAcidEvil · 12-21-2014, 02:19 PM

Well as much as I want to play on my box, I will let it run at least until Monday after work so that will be like 30-40 more hours...
As far as swapping memory, the only thing I did do was take 1 out at a time which left 3 others and ran until it crashed, then swapped one stick and moved to diff slot.. I had some sort of method (i was drunk) but with each stick having its own time out, it still crashed. That is not to say two sticks are corrupt.

After I run the Memtest I will verify my Bios but I am pretty sure I did indeed update that.

astrogeek · 12-21-2014, 02:32 PM

Quote:

Originally Posted by BAcidEvil

As far as swapping memory, the only thing I did do was take 1 out at a time which left 3 others and ran until it crashed, then swapped one stick and moved to diff slot.. I had some sort of method (i was drunk) but with each stick having its own time out, it still crashed. That is not to say two sticks are corrupt.

I just returned to this thread to make that very suggestion, although I would recommend doing it sober!

If you rotated them through with one always out of the machine and it still followed the same crash pattern, then that is a strong indication that it may not be memory. You might try it with different sets of two cards to make it an even better indicator.

linuxtinker · 12-21-2014, 10:33 PM

Might want to take a look at the PSU as well. If ya have a spare one around swapt it out..

ReaperX7 · 12-22-2014, 02:33 AM

Motherboards can usually be tested by a PCI/PCIe diagnostic card. It's a good idea to have one of these if you experience problems often.

You also should check the power supply if it's sufficient for the hardware in usage. Often some PCs have bare minimum power supplies for the OEM specifications. If you've added any hardware it could be a problem if not enough power is going out into the system components.

pchristy · 12-22-2014, 04:45 AM

I had something similar happening with an old laptop not long ago. It wasn't the memory as such, but WAS poor connections in the memory slots. Thoroughly cleaning the edge connectors on the memory sticks provided a complete cure!

I think the reason for the intermittent nature of the problem was due to heat build up, causing expansion in the edge connectors and leading to intermittent contact.

--
Pete

BAcidEvil · 12-22-2014, 10:22 AM

All very good suggestions;

To add to that, I did buy this PC (HP) from Best Buy last year and it began doing it with the stock components not 2 weeks after I bought it... I know I know why in earth did I not return it? If I had the answer I would be rich... Unfortunately, I did not return it and am Not rich.
That is not to say it is not a faulty PSU or PCI Slot or loose fitting memory slots... It just adds to the complexity.
Initially my thiughts were if Linux had a crash dump file of what errord out when the PC freezes.

I have a lot of troubleshooting that is for sure.

enorbet · 12-23-2014, 11:50 PM

Hello
It is my understanding that lm-sensors is capable of logging a predetermined block of times' variations in voltage, temperature, fan speed, etc. in graphical form so that you may detect trending conditions either as a norm or accompaniing issues leading to a failure. There are some examples here - http://askubuntu.com/questions/41794...peratures-load

onebuck · 12-26-2014, 10:41 AM

Hi,

You should close the box up since that will/should be the situation for a failure. Your internal temps are different with a side panel removed. Put the system in the same state as when the failure occurs. PSU temps will be different with a side off. Do you have a DVM to check the voltage rails for the system? PSU fan operating? Do you have adequate case ventilation?

If you do show a memtest86+ error then I would power down and do a edge clean on the sticks and also the clean the connectors. Do a search here on LQ since I have responded with a proper techniques for cleaning edges & connectors.
Hope this helps.
Have fun & enjoy!