LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > CentOS
User Name
Password
CentOS This forum is for the discussion of CentOS Linux. Note: This forum does not have any official participation.

Notices


Reply
  Search this Thread
Old 09-07-2016, 02:57 AM   #1
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Rep: Reputation: Disabled
CentOS 7 hangs after a period of idle


Hi All,

I have a fairly new PC with the following specs:
1. Gigabyte Z170-D3H Intel Socket 1151 6th Gen Skylake motherboard
2. Intel Core i7 6700 3.4GHz Socket 1151 Skylake CPU
3. 4x 16GB DDR4 2400MHz Corsair Vengeance LPX Red RAM
4. Intel 540 Series M.2 240GB SSD
5. Pioneer Blu Ray writer
6. Corsair 860W AX 80+ Platinum Full Modular 120mm Fan ATX PSU
7. 3x 1TB Seagate HDDs
8. Broadcom NIC (onboard NIC is too new for CentOS)

CentOS 7 is installed and updated. I also have VMware installed, running 2 Windows 7 VMs and 2 CentOS 7 VMs. Jenkins is also installed on the host and the VMs perform integration testing. All works fine, however, recently, the host will hang over the weekend. Come in on Monday and the server is unresponsive. Have to force it to shut down.

Kernel version is 3.10.0-327.28.3.el7.x86_64

I have since updated the motherboards BIOS firmware as well as the SSD's firmware. BIOS is set to UEFI. During the recovery procedure after it hangs, I have noticed that it won't reboot with the press of the restart button. It complains there is no bootable media. Powering it off and then on again and it will boot.

Each time I log back in and check the message log, nothing in there hints at what caused the hang. Nothing gets logged from the time it hanged until it is restarted. I also notice that each time it is restarted, the time and date is wrong from the time the Journal starts til it ends. For example (first line is the last successful log before a forced restart)
Sep 7 04:01:01 TestSrv systemd: Starting Session 1167 of user root.
Sep 7 16:30:19 TestSrv rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="845" x-info="http://www.rsyslog.com"] start
Sep 8 02:00:16 TestSrv journal: Runtime journal is using 8.0M (max allowed 3.0G, trying to leave 4.0G free of 30.5G available → current limit 3.0G).
Sep 8 02:00:18 TestSrv journal: Journal stopped
Sep 7 16:30:18 TestSrv journal: Runtime journal is using 8.0M (max allowed 3.0G, trying to leave 4.0G free of 30.5G available → current limit 3.0G).

Also, I had Sophos Endpoint Protection installed and thought that it was causing the hang during an auto update so uninstalled it and it still hanged after a period of idle.
I have searched the net to no avail for clues as to what would cause it to hang. Anyone have any ideas for where to look for clues.

TIA,
Vlad
 
Old 09-07-2016, 08:01 PM   #2
Emerson
LQ Sage
 
Registered: Nov 2004
Location: Saint Amant, Acadiana
Distribution: Gentoo ~amd64
Posts: 7,661

Rep: Reputation: Disabled
No bootable media makes me think access to the drive is lost for whatever reason. Bad drive, bad cable, whatnot.
 
Old 09-07-2016, 08:12 PM   #3
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Emerson View Post
No bootable media makes me think access to the drive is lost for whatever reason. Bad drive, bad cable, whatnot.
It only does that on a reboot (shutdown -r or pressing the reset button). If I power it off then power it on, it is fine. CentOS is on the M.2 SSD so no cable. It is slotted into the motherboard.
 
Old 09-07-2016, 08:34 PM   #4
Emerson
LQ Sage
 
Registered: Nov 2004
Location: Saint Amant, Acadiana
Distribution: Gentoo ~amd64
Posts: 7,661

Rep: Reputation: Disabled
Quote:
Originally Posted by vladguan View Post
It only does that on a reboot (shutdown -r or pressing the reset button). If I power it off then power it on, it is fine. CentOS is on the M.2 SSD so no cable. It is slotted into the motherboard.
This is exactly what made me think access to the drive is lost. With power off you reset everything, with hot reboot you keep the condition of hardware. Try reseating the M.2 for starters.
 
Old 09-07-2016, 08:38 PM   #5
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Emerson View Post
This is exactly what made me think access to the drive is lost. With power off you reset everything, with hot reboot you keep the condition of hardware. Try reseating the M.2 for starters.
Cheers. I did that already. Unscrewed one end, removed it, checked for condition of contacts, reinserted and re-screwed the lose end. I took yesterday's downtime to update the firmware of the SSD. Will see if it still has issues with booting. Any ideas on where to look for why it hangs after a period of idleness? Could the two be related?

Cheers,
Vlad
 
Old 09-07-2016, 09:31 PM   #6
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Hmm, crashed again even with the SSD firmware update and still wont boot from a reset.

I have fixed the time issue with the journal by running timedatectl to set the timezone.

Last edited by vladguan; 09-07-2016 at 09:35 PM.
 
Old 09-08-2016, 12:43 AM   #7
Jjanel
Member
 
Registered: Jun 2016
Distribution: any&all, in VBox; Ol'UnixCLI; NO GUI resources
Posts: 999
Blog Entries: 12

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
Quote:
it won't reboot with the press of the restart button. It complains there is no bootable media. Powering it off and then on again and it will boot.
Quote:
does that on a reboot (shutdown -r...still wont boot from a reset. Could the two be related?
Does it always say no boot drive when you simply do any&EVERY manual reboot (shutdown -r)?
Does the reset button work ok after a halt (shutdown -h)?
If it only says no boot drive after a freeze, I might guess the SSD boot drive or its controller has a hwd problem.
If the SSD *only* works for a fresh power-on, (if it disappears after ANY kind of reboot), that's a config issue, separate from the OS hang.

Also, I'd wonder what exactly happens if you pulled this boot drive (to simulate a hwd glitch) while it's running (just rebooted during maint.).

One simple debugging idea might be to: leave a looping script always running which logs the state of the disappearing SSD boot drive, like [while(1)] df / > ~/file; sync; sleep 2
(assuming / is the SSD, but don't put this file onto the disappearing-SSD tho!)

Quote:
...the server is unresponsive. Have to force it to shut down.
How do you 'force' it? Power-off? (since it is 'unresponsive')
 
Old 09-08-2016, 01:11 AM   #8
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Jjanel View Post
Does it always say no boot drive when you simply do any&EVERY manual reboot (shutdown -r)?
It does come up after a shutdown -r

Quote:
Originally Posted by Jjanel View Post
Does the reset button work ok after a halt (shutdown -h)?
Yes

Quote:
Originally Posted by Jjanel View Post
If it only says no boot drive after a freeze, I might guess the SSD boot drive or its controller has a hwd problem.
Hmmm... most likely scenario


Quote:
Originally Posted by Jjanel View Post
If the SSD *only* works for a fresh power-on, (if it disappears after ANY kind of reboot), that's a config issue, separate from the OS hang.
See above

Quote:
Originally Posted by Jjanel View Post
Also, I'd wonder what exactly happens if you pulled this boot drive (to simulate a hwd glitch) while it's running (just rebooted during maint.).
Not game enough to pull out a tight fitting M.2 SSD with everything running and spinning

Quote:
Originally Posted by Jjanel View Post
One simple debugging idea might be to: leave a looping script always running which logs the state of the disappearing SSD boot drive, like [while(1)] df / > ~/file; sync; sleep 2
(assuming / is the SSD, but don't put this file onto the disappearing-SSD tho!)
I have done it so will see if anything happens.... Thanks for the tip.

Quote:
Originally Posted by Jjanel View Post
How do you 'force' it? Power-off? (since it is 'unresponsive')
Yes, either pressing the power button (once off, pressing the power button boots OK) or the reset button (wont boot).
 
Old 09-08-2016, 02:51 AM   #9
Jjanel
Member
 
Registered: Jun 2016
Distribution: any&all, in VBox; Ol'UnixCLI; NO GUI resources
Posts: 999
Blog Entries: 12

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
I'd *guess* that some messages would appear / get logged
if a CentOS7 boot/root disk 'disappeared' offline. (with syslog going to OTHER drive)
And ping [etc] might still work.
(Maybe someone [with more 'yank-able'/expendable hw] could try/confirm this)

So, it's quite a mystery that the system freezes solid AND the boot drive disappears offline.
Esp. without any 'related' changes, after working ok.
I tried to come up with decent web-search for this, but the wording is soooo variable, that I couldn't. (Again, other LQers' advice welcome.)

Also note the time it stops responding to ping from a remote host,
to correlate with the time of the [debug] > ~/file
Maybe add some 'health' info too, like top, ...

While the BIOS [EFI?] is saying 'no bootable drives' (after reset button after freeze),
can you query the BIOS hw/disk config info? Every 'clue' helps!

Maybe move the SSD to a different controller if possible...
Or replace motherboard (if still hangs, return new mb within 30days =$0?) ...

Last edited by Jjanel; 09-08-2016 at 03:03 AM.
 
Old 09-08-2016, 04:55 AM   #10
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Ok, some additional info.
1. I did notice that once it froze, pressing the reset and won't boot so pressed reset again to go into BIOS and the SSD did not exist. It only came back when I powered it off and on again and gone into BIOS again.

2. Each time it froze, I was still able to ping it but could not SSH to it. It's login screen at the PC is also frozen. Jenkins web interface is also unreachable.
 
Old 09-08-2016, 05:38 AM   #11
Jjanel
Member
 
Registered: Jun 2016
Distribution: any&all, in VBox; Ol'UnixCLI; NO GUI resources
Posts: 999
Blog Entries: 12

Rep: Reputation: 364Reputation: 364Reputation: 364Reputation: 364
I just remembered SysRq! See sysrq.txt (searched: kernel intitle:hang|freeze sysrq|kdump )
some info on it My CentOS7 had 16 (only sync), so the echo 1 needed (when booted)!

Last edited by Jjanel; 09-08-2016 at 05:48 AM.
 
Old 09-08-2016, 05:49 AM   #12
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Jjanel View Post
I just remembered SysRq! See sysrq.txt (searched: kernel intitle:hang|freeze sysrq|kdump )
some info on it My CentOS7 had 16 (only sync), so the echo 1 needed (when booted)!
Cheers. Will try that tomorrow.
 
Old 09-08-2016, 10:20 PM   #13
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Jjanel View Post
I'd *guess* that some messages would appear / get logged
if a CentOS7 boot/root disk 'disappeared' offline. (with syslog going to OTHER drive)
And ping [etc] might still work.
(Maybe someone [with more 'yank-able'/expendable hw] could try/confirm this)

So, it's quite a mystery that the system freezes solid AND the boot drive disappears offline.
Esp. without any 'related' changes, after working ok.
I tried to come up with decent web-search for this, but the wording is soooo variable, that I couldn't. (Again, other LQers' advice welcome.)

Also note the time it stops responding to ping from a remote host,
to correlate with the time of the [debug] > ~/file
Maybe add some 'health' info too, like top, ...

While the BIOS [EFI?] is saying 'no bootable drives' (after reset button after freeze),
can you query the BIOS hw/disk config info? Every 'clue' helps!

Maybe move the SSD to a different controller if possible...
Or replace motherboard (if still hangs, return new mb within 30days =$0?) ...

Hi Jjanel,

When it freezes and I press the reset button and then go into BIOS, the SSD is missing. Unfortunately, I cannot move the SSD to another slot as there is only one M.2 slot on the motherboard.

I have modified the sysrq file and will use it when it next freezes.
 
Old 09-08-2016, 10:29 PM   #14
Emerson
LQ Sage
 
Registered: Nov 2004
Location: Saint Amant, Acadiana
Distribution: Gentoo ~amd64
Posts: 7,661

Rep: Reputation: Disabled
I fail to see how SysRq can help to diagnose this problem further. It might be there is some power save function in BIOS/EFI or in kernel that is causing this, possible firmware bug being the underlying problem.
 
Old 09-08-2016, 10:39 PM   #15
vladguan
Member
 
Registered: Jun 2014
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Emerson View Post
I fail to see how SysRq can help to diagnose this problem further. It might be there is some power save function in BIOS/EFI or in kernel that is causing this, possible firmware bug being the underlying problem.
Will be using it to do a clean shutdown. It had crashed with BIOS power save on and off. With it off I do see error message in dmesg for pcie_aspm.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to suspend after idle or inactivity period without desktop environment or X? ineloquucius Linux - Newbie 7 12-30-2018 04:12 PM
IDLEOUT , TMOUT --> Logout users of a specified period of idle time. barnarasta Linux - Newbie 8 10-02-2012 06:53 PM
Ubt. 9.04 hangs when idle for >30mins aurora72 Linux - Newbie 2 07-21-2009 01:39 PM
Centos - console hangs after inactivity period yuri_d Red Hat 3 09-15-2006 07:52 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > CentOS

All times are GMT -5. The time now is 08:17 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration