LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 04-16-2024, 05:24 PM   #16
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0

Sorry we missed your answer ...

Quote:
Originally Posted by TB0ne View Post
1 - How, exactly, did you 'swap' things to a new machine, and what kind of machine did you swap TO? And it 100% could be hardware related, since (if you're just moving the hard drive), that IT is the problem.

2 - Errors will obviously show up BEFORE the crash...afterwards, it's showing normal boot/warning messages.

3 - If you have several servers, why is it a bad idea to put in a monitoring solution that can watch whatever you have now and whatever you ADD??

You're omitting a good bit:
  • 4 -
  • What kind of hardware you're moving this hard drive to
    5 -
  • What services are running on this server
    6 -
  • Has anything changed/been modified/added to this server before this problem started?
    7 -
  • How many users?
    8 -
  • How much storage?
    9 -
  • How much memory? (and have you tested THAT as well??)
There are loads of factors that can cause this, but you've not given us any error messages to work with.
1 - We have 2 machines identical, we just replaced the disk of the good machine by this of the supposed bad one and as we observed the same consequences we have deducted the problem was not hardware.
2 - You say there is some error messages which are missed and not added on logs afterwards ? In any case for the moment we will at least be able to see what is displayed on the screen before forcing the restart
3 - It is a good idea but this is no the priority of things ... but it is planned ; too we prefer to not install Nagios as it is very big for our needs (we tested netdata more lighter).
4 - see 1
5 - A degraded proprietary telephony IPBX server
6 - No, not that we know
7 - About 60 persons
8 - Less 100 GB
9 - 4 GB RAM

What we are looking for is to understand what is wrong. We lack method.

Last edited by lenainjaune; 04-16-2024 at 05:50 PM.
 
Old 04-16-2024, 05:32 PM   #17
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Instead of moving that drive to a different identical server, try taking a new drive of the same capacity and type, CLONE or image the old one onto the new one, and put THAT in the other machine. If it still freezes the same it is either a software issue or the new drive is faulty. (Which is very unlikely.)

IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.
 
Old 04-19-2024, 07:54 AM   #18
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by wpeckham View Post
Instead of moving that drive to a different identical server, try taking a new drive of the same capacity and type, CLONE or image the old one onto the new one, and put THAT in the other machine. If it still freezes the same it is either a software issue or the new drive is faulty. (Which is very unlikely.)
Oh yes ! We completely missed this point thank you to remind us. We will clone it and test it !

Quote:
Originally Posted by wpeckham View Post
IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.
Yes that is exactly what we want but we do not know how to activate it ? Yes we are aware that this will degrade the performances but we will test it temporarily to see what happens.
 
Old 04-19-2024, 11:13 AM   #19
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by lenainjaune View Post
Oh yes ! We completely missed this point thank you to remind us. We will clone it and test it !
Cloned with clonezilla with success ! The system booted and the telephony works. Now we scan badblocks (smartctl has been yet tested with success while the system was running)

Too, as we must photography the screen when the system will freeze, have you a solution to prevent the monitor to go to sleep ?

We heard about setterm but it seems this does not working (or we do not apply correctly), nor in adding the apm=off to the grub config, nor BIOS setting relative.

Last edited by lenainjaune; 04-19-2024 at 11:21 AM.
 
Old 04-19-2024, 12:12 PM   #20
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Quote:
Originally Posted by lenainjaune View Post
Cloned with clonezilla with success ! The system booted and the telephony works. Now we scan badblocks (smartctl has been yet tested with success while the system was running)

Too, as we must photography the screen when the system will freeze, have you a solution to prevent the monitor to go to sleep ?

We heard about setterm but it seems this does not working (or we do not apply correctly), nor in adding the apm=off to the grub config, nor BIOS setting relative.
Every time I have had a problem like this, there was a code displayed on the monitor. If the monitor was asleep, waking it up displayed the code.

That said, there were times when that code was NOT the one I needed, and I had to redirect the std or kernel messages elsewhere so I could analyze them later.
 
Old 04-19-2024, 12:15 PM   #21
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
If the machine with the cloned drive freezes, I would look at rebuilding the OS and software.
IF the cloned machine does not freeze, the hardware that DOES freeze is clearly suspect. Perhaps the old hard drive? One might hope, because replacing that old hard drive is the fastest and easiest fix.
 
Old 04-22-2024, 06:58 AM   #22
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by wpeckham View Post
Every time I have had a problem like this, there was a code displayed on the monitor. If the monitor was asleep, waking it up displayed the code.
In our case the keyboard was ineffective. It did not awake the monitor.

Quote:
Originally Posted by wpeckham View Post
That said, there were times when that code was NOT the one I needed, and I had to redirect the std or kernel messages elsewhere so I could analyze them later.
To output elsewhere than stdout ?

You know how to make this globally to a file ?
 
Old 04-22-2024, 07:05 AM   #23
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by wpeckham View Post
If the machine with the cloned drive freezes, I would look at rebuilding the OS and software.
IF the cloned machine does not freeze, the hardware that DOES freeze is clearly suspect. Perhaps the old hard drive? One might hope, because replacing that old hard drive is the fastest and easiest fix.
The old hard disk has just been checked (smart + badblocks) ... without error !

So the problem is elsewhere ... or more subtile (we will to observe the server for 2 weeks ; if it does not block once, the problem will still be due to the hard drive)
 
Old 04-22-2024, 08:08 AM   #24
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,798

Rep: Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201
Perhaps the NIC gets hung when it wakes up from power-safe.
Quote:
adding the apm=off to the grub config
This is not used!
Instead add acpi=off

For the old grub:
Quote:
Append acpi=off to the kernel boot command line in /boot/grub/grub.conf
For the new grub (grub2):
Quote:
Append acpi=off to the kernel boot command line in /boot/grub2/grub.cfg
and run the command
grub2-mkconfig
Ensure you do this for the running/active kernel line, not for alternative kernel lines (like old kernel or failsafe).
 
Old 04-22-2024, 09:14 AM   #25
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by MadeInGermany View Post
Perhaps the NIC gets hung when it wakes up from power-safe.

This is not used!
Instead add acpi=off

For the old grub:

For the new grub (grub2):


Ensure you do this for the running/active kernel line, not for alternative kernel lines (like old kernel or failsafe).
We configured grub as this :

Code:
root@host:~# /etc/default/grub
...
# https://www.linuxquestions.org/questions/linux-hardware-18/acpi-errors-4175648794/#post5965271
GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi=off apm=off"
...
Even so after update, the monitor had been gone in sleep mode.

We will trying to remove "apm" parameter to let only "acpi" but we doubt of the result ...
 
Old 04-22-2024, 11:01 AM   #26
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,798

Rep: Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201
An extra boot argument should not harm, in the worst case gives a "unknown, ignored" message.
Ensure that the generated /boot/grub/grub.cfg has the desired option!
Run grub2-mkconfig or grub-mkconfig to update it!

The Xserver (or Wayland) can switch the display to dark without acpi; some monitors go to power-safe soon after being dark.
 
Old 04-22-2024, 11:35 AM   #27
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by MadeInGermany View Post
An extra boot argument should not harm, in the worst case gives a "unknown, ignored" message.
Ensure that the generated /boot/grub/grub.cfg has the desired option!
Run grub2-mkconfig or grub-mkconfig to update it!
It have them :
Code:
root@host:~# grep acpi /boot/grub/grub.cfg 
	linux	/vmlinuz-4.9.0-19-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
		linux	/vmlinuz-4.9.0-19-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
		linux	/vmlinuz-4.9.0-6-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
root@host:~# uname -r
4.9.0-19-amd64
Quote:
Originally Posted by MadeInGermany View Post
The Xserver (or Wayland) can switch the display to dark without acpi; some monitors go to power-safe soon after being dark.
We have a system without graphical server (no X-server nor wayland). It is a nude server system without desktop environment.

Furthermore we did not find a monitor setting dedicated to power-safe or it is a BIOS parameter.

Last edited by lenainjaune; 04-22-2024 at 11:40 AM.
 
Old 04-22-2024, 12:05 PM   #28
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Quote:
Originally Posted by lenainjaune View Post
The old hard disk has just been checked (smart + badblocks) ... without error !

So the problem is elsewhere ... or more subtile (we will to observe the server for 2 weeks ; if it does not block once, the problem will still be due to the hard drive)
#1 a drive problem may not be the MEDIA, or detected by SMART, if it is in the ELECTRONICS! If you have ever had one apart, there is a circuit board in there that interfaces the media heads, motor, and interface circuits to the power and interface to the computer. On those electronics is where SMART lives.
I am looking forward to seeing how it does in the next two weeks!

#2 check your TTY options (Ctrl-Alt-F1 through Ctrl-Alt-F6 or higher) and see if one is echoing the kernel journal messages. That would be the one to check...

As much as I hold out hope that the problem is the storage device, it is not easy to imagine what it could do that would lock the machine so bad that it would go totally unresponsive on network, keyboard, and display. That sounds more like power or thermal/CPU issues, or a failed capacitor on the MB.

I am glad for you that the machine is being replaced anyway! IT sounds like it is past due.
I also recommend that after this is all over you revisit your DR plan for this operation. IF this is a critical resource, as it sounds, then you might need a better fail-over/replacement and backup/restore plan for recovery. IF not to modify that plan, at least to verify that it is adequate for your needs.
 
Old 04-25-2024, 09:47 AM   #29
lenainjaune
LQ Newbie
 
Registered: Apr 2024
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by wpeckham View Post
#1 a drive problem may not be the MEDIA, or detected by SMART, if it is in the ELECTRONICS! If you have ever had one apart, there is a circuit board in there that interfaces the media heads, motor, and interface circuits to the power and interface to the computer. On those electronics is where SMART lives.
I am looking forward to seeing how it does in the next two weeks!
There was a freeze on 23/04 but unfortunately someone reboot before a photo was taken. We just warned users to take a photo before restarting.

At least as the problem is always there with the cloned system disk this demonstrates that problem is not relative to the disk.

Quote:
Originally Posted by wpeckham View Post
#2 check your TTY options (Ctrl-Alt-F1 through Ctrl-Alt-F6 or higher) and see if one is echoing the kernel journal messages. That would be the one to check...

As much as I hold out hope that the problem is the storage device, it is not easy to imagine what it could do that would lock the machine so bad that it would go totally unresponsive on network, keyboard, and display. That sounds more like power or thermal/CPU issues, or a failed capacitor on the MB.

I am glad for you that the machine is being replaced anyway! IT sounds like it is past due.
Quote:
Originally Posted by wpeckham View Post
I also recommend that after this is all over you revisit your DR plan for this operation. IF this is a critical resource, as it sounds, then you might need a better fail-over/replacement and backup/restore plan for recovery. IF not to modify that plan, at least to verify that it is adequate for your needs.
Yes we will ! We are considering putting 2 systems in redundancy and when the first will be unresponsive the second will take the relay ... and the more important, we will managing our telephony ourself with our Asterisk system.

---

As we managed to make the screen always on (parameter consoleblank=0 on grub configuration), we noticed that a simulated crash with a kernel panic (we followed this method to achieve it), displays also on the login screen.

Is it sufficient to have information before the freeze ?

We also experimented to make a journalctl command running at boot (so before login) in modifying /etc/rc.local to run this detached command journalctl --follow & and in this case the screen is flooded continuously with no pause (however we discovered that Ctrl + s can stop it and permit access to another tty with Ctrl + Alt + F2 or other). This flood is strange because in logon the command is flood-less. We suppose that is not the right way to do it.
 
Old 04-25-2024, 10:22 AM   #30
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS,Manjaro
Posts: 5,640

Rep: Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697Reputation: 2697
Information from before the freeze, in particular JUST before the freeze so it is likely to capture the cause, is the ONLY information that might be seriously helpful. AT the freeze logging will stop and you will get no information, and AFTER the freeze is also after the reboot and the cause information may be gone for good.

If I understand correctly:
1. if you move the drive ti a different identical machine that one does freeze.
That would eliminate the original machine hardware EXCEPT the drive.
2. IF cloned to a new drive, it will still freeze. That eliminates the drive itself.

If those are both true, we have eliminated all of the hardware and only a software issue can be left.

What has changed about the software or configuration in the few weeks just before this started?
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Apache giving the error :Could not determine the server's fully qualified domain name bcf2 Linux - Server 47 02-13-2015 10:34 PM
[SOLVED] Ubuntu 13.10 - cursor freezes plus Software Center freezes Vocay2 Ubuntu 6 10-19-2013 11:58 AM
How do i determine my IP address? How do i determine my host name? jwymore Linux - Networking 5 02-07-2007 09:57 AM
fedora core 2 (FC2) freezes while running. Cannot boot into KDE it freezes mraswan Fedora 0 05-25-2004 07:46 PM
Apache: httpd: Could not determine the server's fully qualified domain name.. shirtboy Linux - Software 1 11-20-2003 03:47 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 03:01 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration