LinuxQuestions.org - NTP sometimes stops syncing

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - NTP sometimes stops syncing (https://www.linuxquestions.org/questions/linux-server-73/ntp-sometimes-stops-syncing-4175689349/)

NTP sometimes stops syncing

Hello,
We have a dozen CentOS 7.7 virtual machines on VMware Esxi 6.x hosts (with vCenter). They run with open-vm-tools.
All servers have the ntpd service running which synchronizes with a Windows domain controller (PDC).
I noticed that some of them have their time not sychronized.
Windows, Ubuntu or SuSE Linux machines do not experienced such problem.
The out-of-time servers have the ntpd service active (and the process is also running).
I notice is that the system log file, i.e. dmesg, gives me back the last ntp message is recorded several weeks ago. In working systems, ntpd message logging is daily.

ntp runs as /usr/sbin/ntpd -u ntp:ntp -g

/etc/ntp.conf contains:

include /etc/ntp/crypto/pw
restrict 127.0.0.1
keys /etc/ntp/keys
disable monitor
server pdc.bludomain.local iburst

pdc.bludomain.local is the Windows domain controller.

Ideas to make ntpd works?

Quote:

Originally Posted by HTop (Post 6212575)

If you manually restart the ntpd process, can you then successfully issue "ntpd -q"? Does the ntpd process disappear immediately? After a time? Is there anything ntp-related in the log files that might indicate that it's exiting or being killed? I'd open a couple of terminal windows. In one, tail the log file where ntp is logging and grep for ntp. In your case: "tail -f /var/log/dmesg | grep -i ntp". In the second window, check that ntp is running ("ps -ef | grep -i ntp") and if it's not, restart it. Watch for activity in the window that tailing the log file.

Consider tweaking the ntpd command line to include at least one "-d" (debug) switch; you can include more than one for increasing amounts of debugging information. Then restart the service. (Remember, you're tailing the log file, right?)

I typically use only the NTP server's IP address in the config file---rarely the FQDN. If all of the ntp.conf files are set up the same, though, using the FQDN is likely not the problem. But... if changing the server record in "ntp.conf" to the IP address of "pdc.bludomain.local" on the troublesome servers corrects the problem, I'd check if there is something "different" about the DNS settings on those systems from what's configured on the servers that are syncing. Do the non-syncing servers get the same IP address when they lookup the PDC server as, say, the Ubuntu server(s)? Try pinging and running nmap from one of the non-syncing systems and targeting both the FQDN and IP address of the PDC. (You should see port 37 in both sets of nmap's results.) Perhaps the non-syncing systems are looking for an NTP connection where none is available.

There also could something "off" in the VMs' network configuration, too. If, though, you are able to reach other areas of the network from the CentOS systems then this isn't your problem. (Caveat: not a VM "expert" so I could be wrong about that.)

HTH...

Check time sources and status with

Code:

ntpq -np

Ensure that the ESXi has the correct time (it might be used as a hidden source, not shown by ntpq).

Time source is reachable. However, I tried to add another ntp server (not windows) as time source, I will see in the next days if the problem is related to bad Windows time server or not.

The problem virtually solved "automagically".
I think that servers hosting the virtual machines had a high processor usage because they were too loaded. By reducing the number of virtual machines and therefore the load of virtualization hosts, this problem has practically disappeared.

I have seen this behavior before on VMs. The issue is, basically, that the machine isn't really "ON" or "RUNNING" (whichever term you prefer) 100% of the time. Occasionally, the VM is "SLEEPING" (or whatever term you prefer). In other words, it is not getting any bare-metal CPU cycles. Then, it comes back on. This is normal behavior for a VM. The bare-metal CPU has to divide it's time among all of the virtual CPUs. Nott all of them can be active at the same time.

The clock on the VM can only update when it is "RUNNING." When the VM is "SLEEPING," it's clock stops. It does not update it's time.

When the VM transitions from "SLEEPING" to "RUNNING," it's clock is behind actual time. NTP then updates the time to keep it current.
By default, if the time is "too far" off, the daemon will stop. This is the scenario you are encountering.

In this case, I tell NTP to ignore large clock differences. to do that, I change the ntp.conf file to contain the line:
tinker panic 0

For reference, see this RedHat article:
Avoiding clock drift on VMs