LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 08-13-2016, 12:31 PM   #16
zombieno7
Member
 
Registered: Aug 2012
Posts: 49

Original Poster
Rep: Reputation: Disabled

Oh. Sorry, I was thinking in terms of running a test. I just don't think I can get by on a live distro... I guess I'll just keep testing the HDDs and the RAM. I don't have a voltmeter, so I'm going to have to wait to borrow one. I guess I can also build a new kernel in between too. Hopefully all of that either turns up the problem or resolves it.
 
Old 08-13-2016, 12:39 PM   #17
273
LQ Addict
 
Registered: Dec 2011
Location: UK
Distribution: Debian Sid AMD64, Raspbian Wheezy, various VMs
Posts: 7,585

Rep: Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351
Have you at least tried another kernel?
 
Old 08-13-2016, 12:46 PM   #18
zombieno7
Member
 
Registered: Aug 2012
Posts: 49

Original Poster
Rep: Reputation: Disabled
No. Since it happens so irregularly and infrequently, I was hoping that I could figure out the problem through testing, rather than just hoping it didn't happen again.
 
Old 08-13-2016, 12:55 PM   #19
Emerson
LQ Sage
 
Registered: Nov 2004
Location: Saint Amant, Acadiana
Distribution: Gentoo ~arch
Posts: 7,231

Rep: Reputation: Disabled
I'd say it is unlikely a kernel problem hits suddenly after 2 days.
 
Old 08-13-2016, 12:58 PM   #20
273
LQ Addict
 
Registered: Dec 2011
Location: UK
Distribution: Debian Sid AMD64, Raspbian Wheezy, various VMs
Posts: 7,585

Rep: Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351
You use a live distro to rule out software. If the live distro has similar, seemingly inexplicable, errors you know it's hardware...
 
Old 08-13-2016, 01:31 PM   #21
zombieno7
Member
 
Registered: Aug 2012
Posts: 49

Original Poster
Rep: Reputation: Disabled
I would use a live distro, but I can't be without this computer for that long. Even if I mount all of the documents that I need, I can't deal with the general slowness of a live CD and lack of programs. I'm going to do the reverse, basically, and test all the hardware first.
 
Old 08-14-2016, 03:09 AM   #22
mrmazda
Senior Member
 
Registered: Aug 2016
Location: USA
Distribution: openSUSE, Debian, Knoppix, Mageia, Fedora, others
Posts: 3,535
Blog Entries: 1

Rep: Reputation: 1182Reputation: 1182Reputation: 1182Reputation: 1182Reputation: 1182Reputation: 1182Reputation: 1182Reputation: 1182Reputation: 1182
Quote:
Originally Posted by zombieno7 View Post
I guess I'll just keep testing the HDDs and the RAM.
Get in touch with Corsair support, to do whatever it recommends to confirm the PS is in fact OK. DOA is not the only way for a PS to be dead. Some component defects only manifest after more normal use than a factory burn-in. Why did you find it necessary replace the PS in the first place? Could the old one dying have weakened some motherboard component or caused the old HD to fail? It happens.

If PS is OK, then do same with motherboard maker if it's still under warranty. If not still under warranty, exactly how old is it and what make and model? Does support for it note an available BIOS update or something else relevant to your problem? Could your experience be a common problem with that model, or maybe with its CPU?

If you didn't use ddrescue or equivalent to clone the "dying" HD, odds are the clone has at least one error lurking to gotcha when least expected, if it's not already happening..

Having an only work puter malfunction like this is serious. Having something to fall back on while troubleshooting is a good idea, maybe a 6 year old off-lease SFF refurb or a $40 or less off-pallet PC to stick your existing HD (or a clone of it) in wouldn't be so bad an idea, worth the cost of preventing a spontaneous shutdown at a disastrously bad time, and giving you un-pressured opportunity to isolate the problem.
 
1 members found this post helpful.
Old 08-14-2016, 06:36 AM   #23
Shadow_7
Senior Member
 
Registered: Feb 2003
Distribution: debian
Posts: 4,137
Blog Entries: 1

Rep: Reputation: 873Reputation: 873Reputation: 873Reputation: 873Reputation: 873Reputation: 873Reputation: 873
There's kernel modules that can leak RAM, especially those not from kernel sources (wifi drivers / github). And sometimes the network tools like network-manager or wpa_supplicant can be unstable. Monitor your ram usage and if it's high at the time of failure, it could be a contributing factor. A more vetted distro could have versions of things that don't have these issues.

It could also be that your install got corrupted by a bad storage device and doing a fresh install would give you what you already have without the issues. This is why things like ZFS exist. Not because of paranoia, but because things just aren't trustworthy anymore, including your storage devices.

You might also check that you have swap and have it enabled.

# swapon -s
$ cat /proc/sys/vm/swappiness

60 tends to be the default. 20 is more reasonable IMO if you have slow storage devices or want to hit on them less. If it's 0, then that's like not having swap at all. Which is probably what you want if you actually don't have swap. Although only if you're sporting 8GB or more of RAM IMO.
 
Old 08-14-2016, 07:10 AM   #24
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora 34
Posts: 3,664

Rep: Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059Reputation: 1059
The suggestion of running a live distro isn't bad to rule out software. Once hardware is confirmed the only way to solve this may be to start swapping parts. Since the only common failure you reported was a failure of an SSD and an HDD I would start with SATA cables, since they're cheap. Are you using built-in graphics or a card? Any other add-ons that you could disable or remove for test?
 
Old 08-14-2016, 07:55 AM   #25
273
LQ Addict
 
Registered: Dec 2011
Location: UK
Distribution: Debian Sid AMD64, Raspbian Wheezy, various VMs
Posts: 7,585

Rep: Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351Reputation: 2351
I'd say if you're not able to run a live distro (or memtest) either over the weekend if it's a work PC or during the week if it's a home PC then you need a new PC now. If your PC is that important that it can't not be running for 48 hours then you need at least 2 working PCs.
It is possible to troubleshoot most issues but if you can't devote any time to troubleshooting then buy another PC -- well, buy another 2 PCs so that when something like this happens again you can carry on working.
 
Old 08-14-2016, 11:28 AM   #26
zombieno7
Member
 
Registered: Aug 2012
Posts: 49

Original Poster
Rep: Reputation: Disabled
Okay, so I figured I should post an update. First, I guess I should explain the situation. I am a freelance writer. Downtime directly equals lost pay. This computer is both my home and work computer, and the only backup I have is a hacked chromebook running Gentoo, but that doesn't have any of my files, since the first computer is also my home server. I know all of this is a terribly bad idea, and I should have multiple machines, but I simply don't have the money at this point to do that.

Yesterday, I ran memtest for around 12 hours off of a live CD with SMP enabled. It made 7 passes and reported no errors. I should also include that there are 16GB of RAM with NO swap. I don't particularly like using swap on desktops, so I usually don't.

Today, I'm running fsck with badblocks to test the hard drives. I'll post an update to that if anything turns up.

The PSU that was replaced was old. It was an Ultra x3 1600W, and it was causing voltage issues, and it smelled like hot(almost burning) electronics. I replaced it with this Corsair, which ran well for weeks before this problem started. Since voltage was a problem with the previous one, the first things that I did with this one were benchmarks and torture tests. I ran a lot of Unigine Heaven and Valley as well as MPrime to test it and the other components. Everything performed well.

The motherboard is an ASRock Fatality 970, and I bought it in January of this year. I looked at it closely for blown capacitors or anything that looked suspicious, but it looks brand new.

I have 3 hard drives in the system; 2 SSDs and one HDD. The OS is installed on a new Samsung Evo 850. The /home folder is on an older Crucial M500. Then, there is a 2TB storage HDD which I recently replaced because some files on the drive refused to play. I just used dd to copy the old drive to the new one, and it appears to work, but I don't know if something might have gotten messed up in the process.

On the software side, I bought an RX 480 when it was launched and went through the process of bringing the system up to the point where it could run with open source drivers. At the time, it meant running a version of kernel 4.7 from git along with mesa and llvm from git. When kernel 4.7 was released, I just copied the old config from the git version. I don't know if that may have had a negative impact. I am also still running the git versions of llvm and mesa.

That's everything. Thank you all again for your help.

UPDATE: fsck and badblocks finished. No errors reported on either drive.

Last edited by zombieno7; 08-14-2016 at 05:38 PM. Reason: Update
 
Old 08-14-2016, 07:33 PM   #27
computersavvy
Senior Member
 
Registered: Aug 2016
Posts: 1,879

Rep: Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714
Quote:
Originally Posted by zombieno7 View Post
It's definitely not overheating. It's water cooled. The PSU is new but refurbished. The wattage is good, though. I just don't understand why the RAM would randomly become a problem, especially since it's less than a year old. Could it be a bad kernel? I would try to test it, but I can't figure out what's triggering the resets.
I recently had a machine die. I found it shutdown when it normally ran 24 hrs a day, and it would not power on -- That is it would not power on long enough to finish booting. After setting overnight and coming to room temp it would power on but before the boot finished it would die again. Trying it a second or third time gave the same result but the time before it died dropped to seconds, then even less with each try until it would no longer even attempt to power on.
As you said, I did not suspect cooling because I also was using water cooled and the cooler led indicated proper operation.
After trying everything else I could think of I powered it on then went into the bios (uefi) and watched the cpu temp climb like a rocket until it shut down.
Since I only had the one water cooler I installed the factory air cooler and the system is now working (at greatly reduced activity) while I wait for the replacement water cooler to arrive.

Nothing in the logs or any other indications of why but it appears the pump quit on the cooler and the sudden increase in cpu temp caused a safety shutdown. My cooler was only a year old. I would suggest you not rule out the cooler until you have checked everything possible.
Oh!, and because of the sudden death I had to run an fsck on each hard drive partition before I could get it to boot properly once I had installed the substitute cooler.
 
1 members found this post helpful.
Old 08-15-2016, 11:43 AM   #28
zombieno7
Member
 
Registered: Aug 2012
Posts: 49

Original Poster
Rep: Reputation: Disabled
Well, it just happened again last night(5AM EST). I checked the logs this morning, and there was nothing. I did notice a weird 9 minute gap in the log before the restart.

If it is the cooling, why can I run MPrime for hours and not get any reading above 52C through lm_sensors? Or, are you suggesting that lm_sensors is not accurate? I did notice that it usually happens after the machine has been running for a long time(24h+), though.
 
Old 08-15-2016, 02:10 PM   #29
computersavvy
Senior Member
 
Registered: Aug 2016
Posts: 1,879

Rep: Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714
Quote:
Originally Posted by zombieno7 View Post
Okay, so I figured I should post an update. First, I guess I should explain the situation. I am a freelance writer. Downtime directly equals lost pay. This computer is both my home and work computer, and the only backup I have is a hacked chromebook running Gentoo, but that doesn't have any of my files, since the first computer is also my home server. I know all of this is a terribly bad idea, and I should have multiple machines, but I simply don't have the money at this point to do that.
~~~
UPDATE: fsck and badblocks finished. No errors reported on either drive.
Glad that fsck and badblocks finished with no errors.
The only thing I would be worried about on the new hard drive is that you used dd to copy from the old drive with bad blocks. I am not 100% sure how dd works when data has been relocated because of the bad blocks -- Does dd copy the bad blocks?, Does it just stream the data from the relocated block?, Does it ???.
Since dd is a device dump and copies byte for byte (sequentially) from the source it raises questions in my mind. I know that if you use dd to copy a drive of 500GB to one of 2000GB the result is a new drive of 500GB with the partition tables copied and everything. I assume from your description that you copied between drives of equal size and since it is working then I would not be worried about the new HDD. What I am not sure of would be the result of copying from a failing drive using dd to a new drive. How does it handle the failed/relocated blocks? Are the bad blocks on the old drive also marked bad on the new drive? I just don't know.

Something to consider, since you only have one machine with hdd storage, might be to get an nas storage system that both your pc and the chromebook could access. It would be cheaper that a second pc, would provide data protection with raid, and would still function even if the pc itself fails thus allowing you to continue with the chromebook while troubleshooting the pc. Just a thought.
Another thought on the data storage accessibility issue would be that you might consider a usb enclosure for the hdd. I assume the chromebook has usb ports, and thus putting the drive in a usb enclosure would make it usable by the chromebook even when the pc is down. A usb enclosure for a hdd from newegg is listed for about $30 +/-

About my earlier post regarding cooling. It was just a thought based on my recent experiences. Unfortunately I don't know of any way to monitor and log cpu temperatures real time other than lm_sensors and gkrellm. Possibly a script that would log the temps from sensors every few seconds could give you an idea if temperatures may be at fault. It also would log voltages and fan operation in case of instability there causing the problem. Unfortunately that kind of script could increase system load while running but may be something to consider and could monitor several areas.
 
Old 08-15-2016, 02:42 PM   #30
zombieno7
Member
 
Registered: Aug 2012
Posts: 49

Original Poster
Rep: Reputation: Disabled
As for the temperatures, I run i3 with lm_sensors output in the status bar. Interestingly, after your post, I decided to allow the computer to idle in the BIOS. The temperatures have been climbing VERY slowly since. It's taken since then to go from around 30C to 44.5C now. It might be possible, but I don't have enough info as of yet.

DD does copy exactly, and that was my concern. I did get the same errors from the drive on boot that I did from the old one, but after running fsck, they stopped. I'm guessing it did copy with the problems, but since the drive itself is good, fsck was just able to fix them.

Overall, now, I'm beginning to wonder if the pump I'm using isn't failing and the lm_sensors output isn't just wrong.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Random restarts Slackware64 13.37 rkfb Slackware 5 03-24-2012 07:51 PM
random restarts? nick623 Ubuntu 10 05-08-2007 03:07 PM
Random Crashes w/ no error messages in DeMuDi MichaelS Debian 0 06-17-2006 01:09 PM
Still X Random restarts redhatnoob Linux - General 2 02-17-2004 04:25 PM
Random X restarts on new install tomser Linux - Hardware 6 09-07-2003 09:32 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 12:19 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration