LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 04-16-2024, 09:40 AM   #1
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Rep: Reputation: 0
Unhappy Determine what freezes a server


Hi all

We have an old server whose system freezes approximately every week requiring forcing a force restart. So far we have not found the reason. The system is a Debian 9 with kernel 4.9.0-19-amd64.

Here is the error from a given day when we had to reboot the machine :

Code:
root@server:~# journalctl -p err
...
-- Reboot --
avril 11 11:40:55 server.local kernel: ACPI Error: [CAPB] Namespace lookup failure, AE_ALREADY_EXISTS (20160831/dsfield-211)
avril 11 11:40:55 server.local kernel: ACPI Error: Method parse/execution failed [\_SB.PCI0._OSC] (Node ffff957b499bbaa0), AE_ALREADY_EXISTS (20160831/psparse-543)
avril 11 11:40:55 server.local kernel: platform INT0800:00: failed to claim resource 0
avril 11 11:40:55 server.local kernel: acpi INT0800:00: platform device creation failed: -16
avril 11 11:41:15 server.local systemd[1]: Failed to start Network UPS Tools - power device driver controller.
avril 11 11:41:18 server.local ntpd[710]: inappropriate address 127.0.0.1 for the fudge command, line ignored
...
Based on this we disabled temporarily ACPI at boot (here is indicated the ACPI is problematic with Linux) and UPS to see if the trouble continues.

We ask to ourself if we must parse something else as we suppose that disabling will change nothing.

To eliminate the most hardware causes we tested to swap the disk in another perfectly working machine (the same model) but we did not see some difference.

Have you some advises to determine the exact problem ? A method we did not apply ? A deeper debug process ?

Thank you in advance for your help.
With adelphity,
lnj

Last edited by lenainjaune; 04-16-2024 at 10:01 AM.
 
Old 04-16-2024, 09:54 AM   #2
lvm_
Member
 
Registered: Jul 2020
Posts: 932

Rep: Reputation: 337Reputation: 337Reputation: 337Reputation: 337
ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.
 
Old 04-16-2024, 10:21 AM   #3
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Hi and really thank you for your reactivity !

Quote:
Originally Posted by lvm_ View Post
ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.
As the problem is not cyclic we can not anticipate on the moment it springs so appart to stare the screen for a week nothing to do.

Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all.

But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot.

Nothing to dig apart from that ? A system monitoring in addition of the logs ?
 
Old 04-16-2024, 10:31 AM   #4
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,662
Blog Entries: 4

Rep: Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942Reputation: 3942
Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)
 
Old 04-16-2024, 10:42 AM   #5
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,637

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by lenainjaune View Post
Hi and really thank you for your reactivity !
As the problem is not cyclic we can not anticipate on the moment it springs so appart to stare the screen for a week nothing to do. Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all. But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot.

Nothing to dig apart from that ? A system monitoring in addition of the logs ?
You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky. And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??
 
Old 04-16-2024, 10:55 AM   #6
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Quote:
Originally Posted by TB0ne View Post
You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky. And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??
I second this. IF the cause or evidence of the cause is logged at all, it will be in the minutes or seconds just BEFORE the freeze, so that is where you need to look. You might also monitor activity levels, temperatures, and other physical and operational statistics from a remote node (or monitoring server: see Nagios etc.) with logging so you can check from a non-frozen device what clues might exist.

IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had.

Last edited by wpeckham; 04-16-2024 at 10:58 AM.
 
Old 04-16-2024, 10:57 AM   #7
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
The dedicated monitor is in place now ...

Quote:
Originally Posted by sundialsvcs View Post
Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)
We forget to say that from LAN, ping and ssh both fail. The machine seems shutdown.

Not yet tested with telnet as SSH fails. So we will test it.

Is this command sufficient to test what you propose or we really need access with telnet ?

Code:
root@host-in-lan:~# nping -p 23 --tcp tel
VTY are not accessible too.
 
Old 04-16-2024, 11:07 AM   #8
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Quote:
Originally Posted by lenainjaune View Post
The dedicated monitor is in place now ...
Good, it may display something useful at freeze.
Quote:
We forget to say that from LAN, ping and ssh both fail. The machine seems shutdown.
That might indicate that the entire node is shut down, or that the services were stopped, or that the NIC is no longer talking. Good to know.
Quote:
Not yet tested with telnet as SSH fails. So we will test it.

...
VTY are not accessible too.
If you get that result with ssh and ping, I see no additional value that could be provided by telnet.
 
Old 04-16-2024, 11:39 AM   #9
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by TB0ne View Post
You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky.
An home PC LENOVO ThinkCentre M58, dmidecode indicates that the BIOS is released on 27/11/11 BUT as we stated before we tried to swap the machine with one which run perfectly ... and which gave the same symptoms, it froze .

So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative.

Quote:
Originally Posted by TB0ne View Post
And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??
Quote:
Originally Posted by wpeckham View Post
I second this. IF the cause or evidence of the cause is logged at all, it will be in the minutes or seconds just BEFORE the freeze, so that is where you need to look.
We look at what's in the log at the time the machine crashed (in the OP the 11 april). What is the difference with BEFORE and AFTER event, it is logged that is it ?


Quote:
Originally Posted by wpeckham View Post
You might also monitor activity levels, temperatures, and other physical and operational statistics from a remote node (or monitoring server: see Nagios etc.) with logging so you can check from a non-frozen device what clues might exist.
No Nagios for our park (it is like using a tank to kill a fly) but maybe we can consider to install a netdata standalone monitoring. But is there no simpler solution than deploying a such infrastructure ?

Too as previously said : we ever tested to swap the disk in a working computer with the same results.

Quote:
Originally Posted by wpeckham View Post
IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had.
Not enterprise hardware but a home PC so no luck in this way.
 
Old 04-16-2024, 12:05 PM   #10
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
We tried to install netdata but with no luck for the moment in reason of a missing dependency (zlib1g-dev which gives this error "E: Unable to correct problems, you have held broken packages"). As this server is critical for us, we hesitate to manipulate further.

At any rate we will come back when the system get stuck the next time to give you more information.
 
Old 04-16-2024, 12:18 PM   #11
michaelk
Moderator
 
Registered: Aug 2002
Posts: 25,714

Rep: Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899
There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems.
The slight possibility the KVM switch might be causing a hang up or something else attached to the computer.
The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use.

Might be time to upgrade the hardware.

Last edited by michaelk; 04-16-2024 at 12:20 PM.
 
Old 04-16-2024, 12:52 PM   #12
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,637

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by lenainjaune View Post
An home PC LENOVO ThinkCentre M58, dmidecode indicates that the BIOS is released on 27/11/11 BUT as we stated before we tried to swap the machine with one which run perfectly ... and which gave the same symptoms, it froze . So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative.
How, exactly, did you 'swap' things to a new machine, and what kind of machine did you swap TO? And it 100% could be hardware related, since (if you're just moving the hard drive), that IT is the problem.
Quote:
We look at what's in the log at the time the machine crashed (in the OP the 11 april). What is the difference with BEFORE and AFTER event, it is logged that is it?
Errors will obviously show up BEFORE the crash...afterwards, it's showing normal boot/warning messages.
Quote:
No Nagios for our park (it is like using a tank to kill a fly) but maybe we can consider to install a netdata standalone monitoring. But is there no simpler solution than deploying a such infrastructure ?
If you have several servers, why is it a bad idea to put in a monitoring solution that can watch whatever you have now and whatever you ADD??
Quote:
Too as previously said : we ever tested to swap the disk in a working computer with the same results. Not enterprise hardware but a home PC so no luck in this way.
You're omitting a good bit:
  • What kind of hardware you're moving this hard drive to
  • What services are running on this server
  • Has anything changed/been modified/added to this server before this problem started?
  • How many users?
  • How much storage?
  • How much memory? (and have you tested THAT as well??)
There are loads of factors that can cause this, but you've not given us any error messages to work with.
 
Old 04-16-2024, 01:03 PM   #13
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by michaelk View Post
There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems.
Its power supply is protected by an UPS but in case of hard drive damage, we decided to run again a smartctl and badblocks checks.

Quote:
Originally Posted by michaelk View Post
The slight possibility the KVM switch might be causing a hang up or something else attached to the computer.
Oh ! We omitted this possibility ... But as the machine is now directly connected to a monitor, we will see if the problem arise yet.

Quote:
Originally Posted by michaelk View Post
The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use.
As said above the machine is bound to an UPS.

Quote:
Originally Posted by michaelk View Post
Might be time to upgrade the hardware.
Better : we are migrating the service but until it is operational, the old one must remain
 
Old 04-16-2024, 01:15 PM   #14
michaelk
Moderator
 
Registered: Aug 2002
Posts: 25,714

Rep: Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899Reputation: 5899
Have you verified the UPS is working?
Is the battery good?
 
Old 04-16-2024, 01:25 PM   #15
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by michaelk View Post
Have you verified the UPS is working?
Is the battery good?
Yes ! There are a few machines bound to the UPS and only one machine with a problem. But I do not know if it is possible the problem is located on one given supply connector.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Apache giving the error :Could not determine the server's fully qualified domain name bcf2 Linux - Server 47 02-13-2015 10:34 PM
[SOLVED] Ubuntu 13.10 - cursor freezes plus Software Center freezes Vocay2 Ubuntu 6 10-19-2013 11:58 AM
How do i determine my IP address? How do i determine my host name? jwymore Linux - Networking 5 02-07-2007 09:57 AM
fedora core 2 (FC2) freezes while running. Cannot boot into KDE it freezes mraswan Fedora 0 05-25-2004 07:46 PM
Apache: httpd: Could not determine the server's fully qualified domain name.. shirtboy Linux - Software 1 11-20-2003 03:47 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 04:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration