LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   CentOS (https://www.linuxquestions.org/questions/centos-111/)
-   -   CentOS 7 hangs after a period of idle (https://www.linuxquestions.org/questions/centos-111/centos-7-hangs-after-a-period-of-idle-4175588911/)

vladguan 09-07-2016 02:57 AM

CentOS 7 hangs after a period of idle
 
Hi All,

I have a fairly new PC with the following specs:
1. Gigabyte Z170-D3H Intel Socket 1151 6th Gen Skylake motherboard
2. Intel Core i7 6700 3.4GHz Socket 1151 Skylake CPU
3. 4x 16GB DDR4 2400MHz Corsair Vengeance LPX Red RAM
4. Intel 540 Series M.2 240GB SSD
5. Pioneer Blu Ray writer
6. Corsair 860W AX 80+ Platinum Full Modular 120mm Fan ATX PSU
7. 3x 1TB Seagate HDDs
8. Broadcom NIC (onboard NIC is too new for CentOS)

CentOS 7 is installed and updated. I also have VMware installed, running 2 Windows 7 VMs and 2 CentOS 7 VMs. Jenkins is also installed on the host and the VMs perform integration testing. All works fine, however, recently, the host will hang over the weekend. Come in on Monday and the server is unresponsive. Have to force it to shut down.

Kernel version is 3.10.0-327.28.3.el7.x86_64

I have since updated the motherboards BIOS firmware as well as the SSD's firmware. BIOS is set to UEFI. During the recovery procedure after it hangs, I have noticed that it won't reboot with the press of the restart button. It complains there is no bootable media. Powering it off and then on again and it will boot.

Each time I log back in and check the message log, nothing in there hints at what caused the hang. Nothing gets logged from the time it hanged until it is restarted. I also notice that each time it is restarted, the time and date is wrong from the time the Journal starts til it ends. For example (first line is the last successful log before a forced restart)
Sep 7 04:01:01 TestSrv systemd: Starting Session 1167 of user root.
Sep 7 16:30:19 TestSrv rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="845" x-info="http://www.rsyslog.com"] start
Sep 8 02:00:16 TestSrv journal: Runtime journal is using 8.0M (max allowed 3.0G, trying to leave 4.0G free of 30.5G available → current limit 3.0G).
Sep 8 02:00:18 TestSrv journal: Journal stopped
Sep 7 16:30:18 TestSrv journal: Runtime journal is using 8.0M (max allowed 3.0G, trying to leave 4.0G free of 30.5G available → current limit 3.0G).

Also, I had Sophos Endpoint Protection installed and thought that it was causing the hang during an auto update so uninstalled it and it still hanged after a period of idle.
I have searched the net to no avail for clues as to what would cause it to hang. Anyone have any ideas for where to look for clues.

TIA,
Vlad

Emerson 09-07-2016 08:01 PM

No bootable media makes me think access to the drive is lost for whatever reason. Bad drive, bad cable, whatnot.

vladguan 09-07-2016 08:12 PM

Quote:

Originally Posted by Emerson (Post 5602197)
No bootable media makes me think access to the drive is lost for whatever reason. Bad drive, bad cable, whatnot.

It only does that on a reboot (shutdown -r or pressing the reset button). If I power it off then power it on, it is fine. CentOS is on the M.2 SSD so no cable. It is slotted into the motherboard.

Emerson 09-07-2016 08:34 PM

Quote:

Originally Posted by vladguan (Post 5602203)
It only does that on a reboot (shutdown -r or pressing the reset button). If I power it off then power it on, it is fine. CentOS is on the M.2 SSD so no cable. It is slotted into the motherboard.

This is exactly what made me think access to the drive is lost. With power off you reset everything, with hot reboot you keep the condition of hardware. Try reseating the M.2 for starters.

vladguan 09-07-2016 08:38 PM

Quote:

Originally Posted by Emerson (Post 5602214)
This is exactly what made me think access to the drive is lost. With power off you reset everything, with hot reboot you keep the condition of hardware. Try reseating the M.2 for starters.

Cheers. I did that already. Unscrewed one end, removed it, checked for condition of contacts, reinserted and re-screwed the lose end. I took yesterday's downtime to update the firmware of the SSD. Will see if it still has issues with booting. Any ideas on where to look for why it hangs after a period of idleness? Could the two be related?

Cheers,
Vlad

vladguan 09-07-2016 09:31 PM

Hmm, crashed again even with the SSD firmware update and still wont boot from a reset.

I have fixed the time issue with the journal by running timedatectl to set the timezone.

Jjanel 09-08-2016 12:43 AM

Quote:

it won't reboot with the press of the restart button. It complains there is no bootable media. Powering it off and then on again and it will boot.
Quote:

does that on a reboot (shutdown -r...still wont boot from a reset. Could the two be related?
Does it always say no boot drive when you simply do any&EVERY manual reboot (shutdown -r)?
Does the reset button work ok after a halt (shutdown -h)?
If it only says no boot drive after a freeze, I might guess the SSD boot drive or its controller has a hwd problem.
If the SSD *only* works for a fresh power-on, (if it disappears after ANY kind of reboot), that's a config issue, separate from the OS hang.

Also, I'd wonder what exactly happens if you pulled this boot drive (to simulate a hwd glitch) while it's running (just rebooted during maint.).

One simple debugging idea might be to: leave a looping script always running which logs the state of the disappearing SSD boot drive, like [while(1)] df / > ~/file; sync; sleep 2
(assuming / is the SSD, but don't put this file onto the disappearing-SSD tho!)

Quote:

...the server is unresponsive. Have to force it to shut down.
How do you 'force' it? Power-off? (since it is 'unresponsive')

vladguan 09-08-2016 01:11 AM

Quote:

Originally Posted by Jjanel (Post 5602284)
Does it always say no boot drive when you simply do any&EVERY manual reboot (shutdown -r)?

It does come up after a shutdown -r

Quote:

Originally Posted by Jjanel (Post 5602284)
Does the reset button work ok after a halt (shutdown -h)?

Yes

Quote:

Originally Posted by Jjanel (Post 5602284)
If it only says no boot drive after a freeze, I might guess the SSD boot drive or its controller has a hwd problem.

Hmmm... most likely scenario


Quote:

Originally Posted by Jjanel (Post 5602284)
If the SSD *only* works for a fresh power-on, (if it disappears after ANY kind of reboot), that's a config issue, separate from the OS hang.

See above

Quote:

Originally Posted by Jjanel (Post 5602284)
Also, I'd wonder what exactly happens if you pulled this boot drive (to simulate a hwd glitch) while it's running (just rebooted during maint.).

Not game enough to pull out a tight fitting M.2 SSD with everything running and spinning :)

Quote:

Originally Posted by Jjanel (Post 5602284)
One simple debugging idea might be to: leave a looping script always running which logs the state of the disappearing SSD boot drive, like [while(1)] df / > ~/file; sync; sleep 2
(assuming / is the SSD, but don't put this file onto the disappearing-SSD tho!)

I have done it so will see if anything happens.... Thanks for the tip.

Quote:

Originally Posted by Jjanel (Post 5602284)
How do you 'force' it? Power-off? (since it is 'unresponsive')

Yes, either pressing the power button (once off, pressing the power button boots OK) or the reset button (wont boot).

Jjanel 09-08-2016 02:51 AM

I'd *guess* that some messages would appear / get logged
if a CentOS7 boot/root disk 'disappeared' offline. (with syslog going to OTHER drive)
And ping [etc] might still work.
(Maybe someone [with more 'yank-able'/expendable hw] could try/confirm this)

So, it's quite a mystery that the system freezes solid AND the boot drive disappears offline.
Esp. without any 'related' changes, after working ok.
I tried to come up with decent web-search for this, but the wording is soooo variable, that I couldn't. (Again, other LQers' advice welcome.)

Also note the time it stops responding to ping from a remote host,
to correlate with the time of the [debug] > ~/file
Maybe add some 'health' info too, like top, ...

While the BIOS [EFI?] is saying 'no bootable drives' (after reset button after freeze),
can you query the BIOS hw/disk config info? Every 'clue' helps!

Maybe move the SSD to a different controller if possible...
Or replace motherboard (if still hangs, return new mb within 30days =$0?) ...

vladguan 09-08-2016 04:55 AM

Ok, some additional info.
1. I did notice that once it froze, pressing the reset and won't boot so pressed reset again to go into BIOS and the SSD did not exist. It only came back when I powered it off and on again and gone into BIOS again.

2. Each time it froze, I was still able to ping it but could not SSH to it. It's login screen at the PC is also frozen. Jenkins web interface is also unreachable.

Jjanel 09-08-2016 05:38 AM

I just remembered SysRq! See sysrq.txt (searched: kernel intitle:hang|freeze sysrq|kdump )
some info on it My CentOS7 had 16 (only sync), so the echo 1 needed (when booted)!

vladguan 09-08-2016 05:49 AM

Quote:

Originally Posted by Jjanel (Post 5602355)
I just remembered SysRq! See sysrq.txt (searched: kernel intitle:hang|freeze sysrq|kdump )
some info on it My CentOS7 had 16 (only sync), so the echo 1 needed (when booted)!

Cheers. Will try that tomorrow.

vladguan 09-08-2016 10:20 PM

Quote:

Originally Posted by Jjanel (Post 5602321)
I'd *guess* that some messages would appear / get logged
if a CentOS7 boot/root disk 'disappeared' offline. (with syslog going to OTHER drive)
And ping [etc] might still work.
(Maybe someone [with more 'yank-able'/expendable hw] could try/confirm this)

So, it's quite a mystery that the system freezes solid AND the boot drive disappears offline.
Esp. without any 'related' changes, after working ok.
I tried to come up with decent web-search for this, but the wording is soooo variable, that I couldn't. (Again, other LQers' advice welcome.)

Also note the time it stops responding to ping from a remote host,
to correlate with the time of the [debug] > ~/file
Maybe add some 'health' info too, like top, ...

While the BIOS [EFI?] is saying 'no bootable drives' (after reset button after freeze),
can you query the BIOS hw/disk config info? Every 'clue' helps!

Maybe move the SSD to a different controller if possible...
Or replace motherboard (if still hangs, return new mb within 30days =$0?) ...


Hi Jjanel,

When it freezes and I press the reset button and then go into BIOS, the SSD is missing. Unfortunately, I cannot move the SSD to another slot as there is only one M.2 slot on the motherboard.

I have modified the sysrq file and will use it when it next freezes.

Emerson 09-08-2016 10:29 PM

I fail to see how SysRq can help to diagnose this problem further. It might be there is some power save function in BIOS/EFI or in kernel that is causing this, possible firmware bug being the underlying problem.

vladguan 09-08-2016 10:39 PM

Quote:

Originally Posted by Emerson (Post 5602746)
I fail to see how SysRq can help to diagnose this problem further. It might be there is some power save function in BIOS/EFI or in kernel that is causing this, possible firmware bug being the underlying problem.

Will be using it to do a clean shutdown. It had crashed with BIOS power save on and off. With it off I do see error message in dmesg for pcie_aspm.


All times are GMT -5. The time now is 08:11 PM.