NTP offset

tony359 · 05-08-2013, 04:24 AM

Hello everybody!

My first message here!
I am dealing with an NTP issue. I have searched the web for a couple of days, got a basic knowledge of how the NTP protocol works, but still I am a bit puzzled and I have a few questions.

Everything started some of our servers could not keep up with time, ending up with awful OFFSET values and drifting.

I understand that the NTP is not a simple synchronisation task that steps the time every time it's run: it's an alghoritm, that pools the time from a number of server and assess the accuracy of the system clock, coming up with a way to slew it so the user will never see the time changing.
Only when the time is wildly out, will the NTP step the time "one off" for restoring the time.

I also understand that the NTP will step the time when the offset is over 125ms and it would refuse to operate when the clock is more than 1000s off. Also, that NTP has a limitation of - if memory serves - 49s per day.

I have set my NTP on my server using 4 external NTP servers, stratum 1, 2 and 3. It looks like the offset varies from 0 to something as big as 600, and I do not know why.

Yesterday I first set my clock 'one off' manually (using NTPDATE, with the NTPD off) then I amended my configuration file and started the NTP deamon.
After a while the NTP.DRIFT file was populated.

I have been monitoring the NTP using NTPQ and I could not find anything obvious (to me).

My problem is that the OFFSET value randomly jumps from 0-ish to 300/500 and I am not sure that that behaviour is normal? I will keep monitoring with my last configuration (previous time I was only using one NTP server). My other servers eventually drift to 30000/50000 until the NTP comes up with a "frequency error".

Here is my current configuration (while I was writing, the offset drifted to 1000!)

Code:

server 130.88.200.4 prefer iburst true
server 158.43.128.66 prefer iburst true
server 81.168.77.149 prefer iburst true
server 192.93.2.20 prefer iburst true
server 212.82.32.15 prefer iburst true
    restrict 127.0.0.1
    restrict 169.254.1.1 mask 255.255.0.0 nomodify
    tos orphan 4
    driftfile /status/etc/ntp/ntp.drift
    logfile /var/log/ntp.log
    multicastclient
    broadcastdelay 0.008
    enable monitor



     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+130.88.200.4    194.66.31.14     2 u    3   64  377   16.051  1115.48   0.977
+158.43.128.66   193.67.79.202    2 u    1   64  377   10.894  1109.79   4.932
*81.168.77.149   82.219.4.30      3 u   20   64  377   39.825  1109.98   0.977
+192.93.2.20     .GPSi.           1 u   17   64  377   28.828  1114.06   0.977
+212.82.32.15    .PPS.            1 u   38   64  377   29.949  1106.91  12.227
ntpq> as

ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 10225  9414   yes   yes  none  candidat   reachable  1
  2 10226  9414   yes   yes  none  candidat   reachable  1
  3 10227  9614   yes   yes  none  sys.peer   reachable  1
  4 10228  9414   yes   yes  none  candidat   reachable  1
  5 10229  9414   yes   yes  none  candidat   reachable  1


ntpq> rv 10227
assID=10227 status=9614 reach, conf, sel_sys.peer, 1 event, event_reach,
srcadr=81.168.77.149, srcport=123, dstadr=10.1.1.30, dstport=123,
leap=00, stratum=3, precision=-16, rootdelay=45.349,
rootdispersion=53.513, refid=82.219.4.30, reach=377, unreach=0, hmode=3,
pmode=4, hpoll=6, ppoll=6, flash=00 ok, keyid=0, ttl=0, offset=1109.982,
delay=39.825, dispersion=1.047, jitter=0.977,
reftime=d53493dd.1e7dc56a  Wed, May  8 2013 10:05:33.119,
org=d534943a.850c23ca  Wed, May  8 2013 10:07:06.519,
rec=d5349439.6e904d67  Wed, May  8 2013 10:07:05.431,
xmt=d5349439.63b47ef1  Wed, May  8 2013 10:07:05.389,
filtdelay=    42.33   39.83   39.83   39.98   41.09   39.93   41.04   40.87,
filtoffset= 1108.99 1109.98 1109.48 1109.23 1109.50 1108.64 1108.67 1109.08,
filtdisp=      0.99    1.02    1.05    1.08    1.11    1.14    1.17    1.20


ntpq> rv
assID=0 status=0624 leap_none, sync_ntp, 2 events, event_peer/strat_chg,
version="ntpd 4.2.4p4@1.1520-o Sun Nov 22 17:34:54 UTC 2009 (1)",
processor="i686", system="Linux/2.6.35.13", leap=00, stratum=4,
precision=-10, rootdelay=39.825, rootdispersion=1112.531, peer=10227,
refid=81.168.77.149,
reftime=d5349439.6e904d67  Wed, May  8 2013 10:07:05.431, poll=6,
clock=d534945c.d3d2d14b  Wed, May  8 2013 10:07:40.827, state=4,
offset=1109.982, frequency=52.120, jitter=0.977, noise=10.164,
stability=3.914, tai=0
ntpq>

From the RV command, I can see that my clock IS adjusted continuously, so why the OFFSET is getting bigger and bigger?

I understand that this could be caused by a hardware clock too drifty, but I am still puzzled.
Also, over 128ms the NTP deamon should step the time, why does that not happen?

What I did notice is that the value in the DRIFT file is changing. It was 49, then -5.99, now it's 24.032.

My logs do not show anything strange

Code:

 7 May 18:27:32 ntpd[4093]: synchronized to 130.88.200.4, stratum 2
 7 May 18:30:53 ntpd[4093]: ntpd exiting on signal 15
 7 May 18:31:12 ntpd[13566]: synchronized to 130.88.200.4, stratum 2
 7 May 21:01:54 ntpd[13566]: ntpd exiting on signal 15
 7 May 21:03:16 ntpd[10666]: invalid flags (9088) in file /tmp/ntpDMGX5S
 7 May 21:03:33 ntpd[7313]: synchronized to 130.88.200.4, stratum 2
 7 May 21:41:44 ntpd[7313]: ntpd exiting on signal 15
 7 May 21:43:00 ntpd[13653]: synchronized to 81.168.77.149, stratum 3
 7 May 21:57:49 ntpd[13653]: ntpd exiting on signal 15
 7 May 21:58:08 ntpd[19317]: synchronized to 81.168.77.149, stratum 3
 7 May 22:00:14 ntpd[19317]: ntpd exiting on signal 15
 7 May 22:00:33 ntpd[25420]: synchronized to 81.168.77.149, stratum 3

Besides a "invalid flags" error which I have never seen before.

Can anybody help me? Please let me know if you need further details.

Thanks!
Tony

tronayne · 05-08-2013, 07:02 AM

You're correct that the offset values look pretty much like a problem and that problem is network related (almost always).

You don't say where in the world you are (I'll assume North America); here's a suggestion that you try the pool servers plus the "local clock" as here:

Code:

server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10
#server  pool.ntp.org
server  0.us.pool.ntp.org
server  1.us.pool.ntp.org
server  2.us.pool.ntp.org

With these settings, on HugesNet satellite service (which has delays for the round trip to the satellite) a "normal" display should look like this:

Code:

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 127.127.1.0     .LOCL.          10 l  15h   64    0    0.000    0.000   0.000
*50.116.55.161   192.5.41.40      2 u  730 1024  377  1280.03  -22.906  32.752
+65.23.154.62    149.20.64.28     2 u  325 1024  377  1341.86  -67.265  73.929
+66.162.15.65    64.236.96.53     2 u  274 1024  377  1414.38  -93.636  95.378

You only need three and you should not be using a stratum 1 server unless you've asked for permission to do so (it's considered impolite, more or less).

The inclusion of

Code:

server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

is so that NTP will fall back on the system clock when no external source is available; like when your network connection goes away for some reason.

What NTP does is evaluate the time servers to select the "best" one of them to synchronize to. That will change from time to time -- it will replace servers that are slow, noisy, or just plain gone out of service periodically which is why it's a good idea to use the pool servers rather than specifying specific addresses.

The drift file, generally /etc/ntp/drift, will change over time as NTP evaluates your system clock versus a time standard. That value should not swing wildly, but it will change for a while then settle down over time -- it can take a few days for that to happen.

When you first implement NTP your system clock may be in never-never land somewhere and it's a good idea to initially set the system clock with ntpdate. Once NTP synchronizes, though, you should not need to do that. The system clock, which is a "software clock," run by the kernel via interrupts. On boot, it is initially set from the hardware clock then NTP keeps it on-time once synchronized to an external time source. My systems are currently synchronized to

Code:

*50.116.55.161   192.5.41.40      2 u  730 1024  377  1280.03  -22.906  32.752

the one that has the asterisk is the one you're synchronized to, the others are candidates for synchronization if that one goes away.

The routine that starts the NTP daemon should look a lot like this:

Code:

#!/bin/sh
# Start/stop/restart ntpd.

# Start ntpd:
ntpd_start() {
  CMDLINE="/usr/sbin/ntpd -g"
  echo -n "Starting NTP daemon:  $CMDLINE"
  $CMDLINE -p /var/run/ntpd.pid
  echo
}

In particular, the CMDLINE="/usr/sbin/ntpd -g", the -g, allows the first adjustment to be big; when the daemon first synchronizes it will slew the system clock (within the limits you've discovered).

So, how does the system time get set? At boot, it should be set from the hardware clock; looks something like this:

Code:

# Set the system time from the hardware clock using hwclock --hctosys.
if [ -x /sbin/hwclock ]; then
  # Check for a broken motherboard RTC clock (where ioports for rtc are
  # unknown) to prevent hwclock causing a hang:
  if ! grep -q -w rtc /proc/ioports ; then
    CLOCK_OPT="--directisa"
  fi
  if grep -wq "^UTC" /etc/hardwareclock ; then
    echo -n "Setting system time from the hardware clock (UTC): "
    /sbin/hwclock $CLOCK_OPT --utc --hctosys
  else
    echo -n "Setting system time from the hardware clock (localtime): "
    /sbin/hwclock $CLOCK_OPT --localtime --hctosys
  fi
  date
fi

So, there's hint -- the hardware clock is run by a battery on the mother board and is usually accurate (not as good as your wristwatch, but pretty good). Depending on how you specified your hardware clock -- either local time or UTC -- the above reads it and initially sets the system clock.

On shutdown, the opposite happens:

Code:

# Save the system time to the hardware clock using hwclock --systohc.
if [ -x /sbin/hwclock ]; then
  # Check for a broken motherboard RTC clock (where ioports for rtc are
  # unknown) to prevent hwclock causing a hang:
  if ! grep -q -w rtc /proc/ioports ; then
    CLOCK_OPT="--directisa"
  fi
  if grep -q "^UTC" /etc/hardwareclock 2> /dev/null ; then
    echo "Saving system time to the hardware clock (UTC)."
    /sbin/hwclock $CLOCK_OPT --utc --systohc
  else
    echo "Saving system time to the hardware clock (localtime)."
    /sbin/hwclock  $CLOCK_OPT --localtime --systohc
  fi
fi

You're running NTP, the system clock is synchronized to an external time source, when you shut down, that sets the hardware clock to the correct time. Pretty neat, huh?

So, bottom line here -- your ntpq display looks like you've hard-defined time sources that may not be worthwhile and you might want to try the pool servers (for a day or so) and see if you get better results. That "for a day or two" is meaningful -- NTP takes time to settle down, it does adjustments over time so let run for a few days and see.

I would remove the multiple "prefer" and "iburst" options from you configuration (you really don't need them and multiple "prefer," well, not good -- see the comment in the ntp.conf file below).

Just in case it helps, here's a long-term, known-good ntp.conf you may find interesting; the stuff that's commented-out just is not used:

Code:

cat /etc/ntp.conf
# Sample /etc/ntp.conf:  Configuration file for ntpd.
#
# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available. The
# default stratum is usually 3, but in this case we elect to use stratum
# 0. Since the server line does not have the prefer keyword, this driver
# is never used for synchronization, unless no other other
# synchronization source is available. In case the local host is
# controlled by some external source, such as an external oscillator or
# another protocol, the prefer keyword would cause the local host to
# disregard all other synchronization sources, unless the kernel
# modifications are in use and declare an unsynchronized condition.
#
server	127.127.1.0	# local clock
fudge	127.127.1.0 stratum 10	
#server  pool.ntp.org
server  0.us.pool.ntp.org
server  1.us.pool.ntp.org
server  2.us.pool.ntp.org

#
# Drift file.
# Put this in a directory which the daemon can write to.
# No symbolic links allowed, either, since the daemon updates the file
# by creating a temporary in the same directory and then rename()'ing
# it to the file.
#
driftfile /etc/ntp/drift
#
# Log file
#
#logconfig=allclock +allpeer +allsys +allsync
#logfile /var/log/ntp.log
#
# Statistics stuff
#
# statsdir /var/log/ntpstats/	# directory for statistics files
# filegen	peerstats file peerstats type day enable
# filegen	loopstats file loopstats type day enable
# filegen	clockstats file clockstats type day enable

multicastclient	224.0.1.1
broadcastdelay	0.008

#
# Keys file.  If you want to diddle your server at run time, make a
# keys file (mode 600 for sure) and define the key number to be
# used for making requests.
# PLEASE DO NOT USE THE DEFAULT VALUES HERE. Pick your own, or remote
# systems might be able to reset your clock at will.
#
#keys		/etc/ntp/keys
#trustedkey	65535
#requestkey	65535
#controlkey	65535

# Don't serve time or stats to anyone else by default (more secure)
restrict default noquery nomodify
# Trust ourselves.  :-)
restrict 127.0.0.1

Hope this helps some.

tony359 · 05-08-2013, 09:53 AM

Tronayne,

Thanks for the very detailed reply, it is really helpful. I do appreciate the time you've taken to write the post.

I have only a couple of problems here. It looks like I do not have DNS on this server - I cannot ping the pool servers, or any www sites, so I am assuming the DNS is not on. It's a special purpose server and the configuration is done by the manufacturer. I have the root access and I can change the NTP configuration file - which is going to be replaced by the 'factory' one on reboot - but so far I haven't got the DNS working). I tried the pool servers before and it would not work.

Is there a way to use the pool servers without DNS? Can I just find out the IP number of 0.uk.pool.ntp.org? (and BTW I am in the UK!)

Also, what puzzles me is that everything seems ok but there is nothing to tell you that the NTP is not actually working? Is there a sort of debug mode or log where I can see what is actually happening? Not that I do not trust you, but it seems strange to me that everything looks fine and there is not a way to find out what is wrong.

Finally, the manufacturer actually suggested me to change the pooling time, reducing the maximum time from 1024s to 256s (maxpool=8). After what I found, I feel that this is not a solution. I think that this wrongly assumes that when the NTP pools a server, it syncs the clock on it, while the entire algorithm is actually constantly evaluating the system clock. My opinion is that reducing the pool time is not going to improve things here. Your opinion?

I will try your configuration file (maybe using static NTP servers for the time being) and I'll post the results shortly.

Thanks!
Tony

tronayne · 05-08-2013, 10:49 AM

You can, of course, ping the pool servers (on a machine with DNS). Downside: they do change from time to time (which is, of course, why you use the pool servers in the first place).

Have you tried adding a DNS server to /etc/resolv.conf? Of the form

Code:

earch com
# Google Free DNS Servers
nameserver 8.8.8.8
nameserver 8.8.4.4

(obviously, use public DNS servers -- just two -- available in the UK) and see if that helps. Watch that DHCP (if you're using that) doesn't wipe it out, though (there's a configuration to stop DHCP from overwriting /etc/resolv.conf).

If NTP is not working (like it died) you won't see something like

Code:

ps -ef | grep ntpd
root      1910     1  0 May07 ?        00:00:02 /usr/sbin/ntpd -g -p /var/run/ntpd.pid

If ntpq -pn shows you this sort of display

Code:

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 127.127.1.0     .LOCL.          10 l  20h   64    0    0.000    0.000   0.000
+50.116.55.161   192.5.41.40      2 u  839 1024  377  1405.66  -31.217  83.114
*65.23.154.62    149.20.64.28     2 u  369 1024  377  1311.92  -20.765  67.435
+66.162.15.65    64.236.96.53     2 u  282 1024  377  1348.51  -24.707  46.598

it is working.

If you must use addresses, ping them like this

Code:

ping -c 5 65.23.154.62
PING 65.23.154.62 (65.23.154.62) 56(84) bytes of data.
64 bytes from 65.23.154.62: icmp_req=1 ttl=45 time=863 ms
64 bytes from 65.23.154.62: icmp_req=2 ttl=45 time=1101 ms
64 bytes from 65.23.154.62: icmp_req=3 ttl=45 time=1200 ms
64 bytes from 65.23.154.62: icmp_req=4 ttl=45 time=776 ms
64 bytes from 65.23.154.62: icmp_req=5 ttl=45 time=1094 ms

--- 65.23.154.62 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3999ms
rtt min/avg/max/mdev = 776.025/1007.216/1200.003/159.899 ms, pipe 2

(that's the server I'm synchronized with); you're looking for the shortest time value and, of course, no dropped packets. Too, keep in mind that my times are longer than yours probably will be because of the satellite delay (22,500 miles up, 22,500 miles down, find the site, then back up and back down again -- takes a while). Pick the best three, put them in your /etc/ntp.conf file as servers and do not use the prefer or iburst directives. Keep in mind that simpler is better.

I'd leave the pool time at the default value unless somebody could prove to me that changing is actually beneficial. I can't imagine any good reason to do that unless the manufacturer has some magical mystical thing going that would be reasonable. Once you're synchronized, you're synchronized and NTP is managing your system clock; end of story.

Are you using this gadget as your internal network time server?

Hope this helps some.

tronayne · 05-08-2013, 10:58 AM

I forgot...

If you want to log NTP in a more-or-less useful way, try something like this (in the NTPD start up file):

Code:

# Start ntpd:
ntpd_start() {
  # Clear the log file
  >/tmp/ntp.log
  CMDLINE="/usr/sbin/ntpd -g"
  echo -n "Starting NTP daemon:  $CMDLINE"
  $CMDLINE -p /var/run/ntpd.pid -l /tmp/ntp.log
  #$CMDLINE -p /var/run/ntpd.pid
  echo
}

Or, temporarily, make sure it's not running and then simply start it manually with

Code:

>/tmp/ntp.log
/usr/sbin/ntpd -g -p /var/run/ntpd.pid -l /tmp/ntp.log

You can then

Code:

tail -f /tmp/ntp.log

to see what's going on. Remember that it takes a few minutes for the initial synchronization.

Hope this helps some.

tony359 · 05-08-2013, 11:04 AM

Hi,

No, it's a multimedia server. It streams audio and picture to another device. I don't need it to be massively precise, but it may be a problem if after 6 months the clocks are +/- 5 minutes!

I'd rather use IPs for the time being. I selected the ones you saw by pinging them, as you suggested. I will scrap the stratum 1 servers - I did notify them though! - and add everything else you mentioned.

Yes, I do understand that the NTP daemon is working, but I'm still puzzled that there is not a diagnostic tool that tells us what is actually wrong. The offset is going up, we *assume* it's due to the network. It would be nice to have a set of tools that could show what is exactly going wrong and why.
(just seen the addendum,thanks, I'll give it a try!)

But I'm happy with the empiric way!

I'll keep doing tests and I'll update soon.

Thanks again
Tony

tronayne · 05-08-2013, 01:16 PM

Quote:

Originally Posted by tony359

Yes, I do understand that the NTP daemon is working, but I'm still puzzled that there is not a diagnostic tool that tells us what is actually wrong. The offset is going up, we *assume* it's due to the network. It would be nice to have a set of tools that could show what is exactly going wrong and why.

It's not an assumption; the offset value shows the difference between the reference time and the system clock (in milliseconds).

For a tiny offset ntpd will adjust the local clock as usual; for small and larger offsets, ntpd will reject the reference time for a while. In the latter case the operation system's clock will continue with the last corrections effective while the new reference time is being rejected. After some time, small offsets (significantly less than a second) will be slewed (adjusted slowly), while larger offsets will cause the clock to be stepped (set anew). Huge offsets are rejected, and ntpd will terminate itself, believing something very strange must have happened. Do you see ntpd terminated in your log?

ntpq -pn displays the offsets for each reachable server in milliseconds (ntpdc -p uses seconds instead).

ntpdc -c loopinfo displays the combined offset in seconds, as seen at the last poll. If supported, ntpdc -c kerninfo will display the current remaining correction, just as ntptime does.

For example,

Code:

ntpdc -c loopinfo 
offset:               0.050242 s
frequency:            -9.689 ppm
poll adjust:          30
watchdog timer:       2497 s

and

Code:

ntpdc -c kerninfo
pll offset:           0.0261205 s
pll frequency:        -9.689 ppm
maximum error:        2.83619 s
estimated error:      0.037673 s
status:               2001  pll nano
pll time constant:    10
precision:            1e-09 s
frequency tolerance:  500 ppm

and

Code:

ntptime
ntp_gettime() returns code 0 (OK)
  time d5351340.42643ca8  Wed, May  8 2013 14:09:04.259, (.259342745),
  maximum error 2875694 us, estimated error 37673 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
  modes 0x0 (),
  offset 25621.470 us, frequency -9.689 ppm, interval 1 s,
  maximum error 2875694 us, estimated error 37673 us,
  status 0x2001 (PLL,NANO),
  time constant 10, precision 0.001 us, tolerance 500 ppm,

Note that yours may not display the above (ntptime) if the kernel does not support it.

Bottom line? Make sure you've got electrically close servers defined (meaning shortest ping times) or use the pool servers (meaning you most likely have to put entries in /etc/resolv.conf), start 'er up and let it run for at least three days -- it takes that long for things to really settle down.

Hope this helps some.

tony359 · 05-08-2013, 01:23 PM

Quote:

Hope this helps some.

It does.

I'll be back in three days!

Thanks!

edit: in the meantime...

Monitoring the log as you suggested, it came up with this:

Quote:

8 May 17:52:30 ntpd[26069]: logging to file /tmp/ntp.log
8 May 17:52:30 ntpd[26069]: precision = 1000.000 usec
8 May 17:52:30 ntpd[26069]: unable to bind to wildcard socket address 0.0.0.0 - another process may be running - EXITING

Any reasons to be worried?

tony359 · 05-14-2013, 02:20 AM

Hello,

It seems to work, as predicted.
I think I read somewhere that there is a way to graph the offset value - or just to log it, I'll then process the file.
Any thoughts on how can I do it? I'll come back with more details later.

edit: would that work? I can't see any files in the /var/log/ntp folder, are they only created at the end of the day?

Code:

server 127.127.1.0
fudge 127.127.1.0 stratum 10
server 158.43.128.66
server 81.168.77.149
server 130.88.200.4
driftfile /etc/ntp/drift
logconfig=allclock +allpeer +allsys +allsync
logfile /var/log/ntp.log
multicastclient 224.0.1.1
broadcastdelay 0.008
restrict 127.0.0.1
statistics loopstats
statsdir /var/log/ntp/
filegen peerstats file peers type day link enable
filegen loopstats file loops type day link enable

Thanks
Tony

tronayne · 05-14-2013, 07:28 AM

So, NTP has walked your clock into synchronization? OK, that tells that everything is working as it should (and you can probably stop fiddling with it and just ignore it).

Note that there are at least two ways to log what NTP is doing. One is the simple log that gets started by

Code:

# Start ntpd:
ntpd_start() {
  # Clear the log file
  >/tmp/ntp.log
  CMDLINE="/usr/sbin/ntpd -g"
  echo -n "Starting NTP daemon:  $CMDLINE"
  $CMDLINE -p /var/run/ntpd.pid -l /tmp/ntp.log
  #$CMDLINE -p /var/run/ntpd.pid
  echo
}

this is the NTP daemon start at boot time. That log is quite useful. Note that it logs into the /tmp directory which can simply be changed to /var/log/ntp/ntp.log.

Another way is here: If you've defined your logging to go into /var/log/ntp, you need a definition of that in /etc/ntp.conf; e.g.,

Code:

#
# Log file
#
logconfig=allclock +allpeer +allsys +allsync
logfile /var/log/ntp/ntp.log

It also seems that you want the statistics files:

Code:

# Statistics stuff
#
statsdir /var/log/ntpstats/   # directory for statistics files
filegen       peerstats file peerstats type day enable
filegen       loopstats file loopstats type day enable
filegen       clockstats file clockstats type day enable

In either or both cases, you need to manually create the directory(ies):

Code:

su -
<root password>
mkdir -p /var/log/ntp
mkdir -p /var/log/ntpstats

Be aware that the statistics files get... well, big and they need to be dealt with on a weekly basis (and, frankly, they don't show you a heck of lot unless you've got a big server farm and you're using the box you've got to serve time to everybody else on your intranet). If you choose that route, you probably want to get a radio or GPS clock with an Ethernet connection on it and use that to serve time to your entire network (and those things ain't cheap). One server can sever time to all without a lot of difficulty (been there, did that, it works) -- NTP does not place a great load on your intranet. You do not need to mess with keys on a private network but you need to decide that for yourself.

There is a great deal of information about monitoring at file:///usr/doc/ntp-4.2.6p5/html/monopt.html (that's the NTP manual that should be installed in /usr/doc/ntp-whatever/monopt.htm on your system; if it's not in /usr/doc, look around for it or go to http://www.ntp.ogr and see the documentation pages.

Now, in most installations there are examples found in /usr/doc/ntp-whatever/scripts. In particular, /usr/doc/ntp-whatever/scripts/monitoring, where you will find a README file and a group of utilities for dealing with the statistics data (and some other stuff).

Read the README. Pay attention to the warnings. Really, pay attention to the warnings.

There are examples that feed data to gnuplot (you should have that on your system, it's kind of a standard -- if not, go get and install it from your distribution software archive). That's what you use to make pretty graphs.

Be aware that the sample utilities will most likely require a little editing for the path to the log files you've created -- the paths and file names I suggested above and elsewhere may not be what's in those example files and you'll need to twiddle some things to get them going.

At one time, some years ago, I did use the statistics to look at some 250 machines using a central NTP time server (25 Solaris boxes, 4 Linux servers [I did say this was a long time ago] and the rest desktop winders things). To be honest, it was more information than I needed and I turned it off after a month or so -- NTP does work and keeps on working if you get it configured and then just leave it alone and let it do its thing.

One final thing you'll want to do (if you're logging) is rotate the logs (they can grow quite large).

Somewhere, hopefully in /etc/logrotate.d, you will want to put this:

Code:

cat ntpd
/var/log/ntp.log {
  rotate 10
  notifempty
  missingok
  compress
  delaycompress
  sharedscripts
  postrotate
    /etc/rc.d/rc.ntpd restart
  endscript
}

That will rotate the log weekly, compressing the log that was rotated and keeping 10 weeks worth (the oldest falls off the end). Notice that this example is my system and, depending upon which logging you choose (from the top of this post), you'll need to edit the location of the log files and make sure that an empty file gets created (when I'm logging, I do that in /etc/rc.d/rc.ntpd with

Code:

# Empty the log file
>/var/log/ntp.log

which is executed when the daemon gets started or restarted.

It'll take some fiddling and twiddling and reviewing what you get until you're happy with the information (and not, you know, overwhelmed by it). Take it slow, try one thing at time and you'll get there.

Hope this helps some.

tony359 · 05-14-2013, 09:20 AM

Thanks again, you're a mine of clear and precious information.

I cannot "leave" it unfortunately! I'll explain why: as mentioned, this are multimedia servers. The NTP.conf is created during boot by the software that runs on the server. When configuring this software, you are allowed to enter only ONE NTP server, which is then embedded in their configuration, which is similar to the one I quoted in my first post - but only with one server.

Now, I have found that the NTP does not work on these servers, apparently we now know why. I am working with the manufacturer - which came up with the suggestion I mentioned. I need to show them that my configuration works. To do so, I need to graph the offset, to confirm them that it eventually settles around zero. I will graph another machine with the configuration the manufacturer suggested and then I'll report to them.

Bottom line, I need the manufacturer to fix the configuration in their configuration files! And I need to show them that there is a problem with the configurations they provide.

I have amended the configuration with the statistics you suggested. Am I expecting to see something happening in /var/log/ntpstats/ in the short terms or will the file be created after a day?

Also, to restart the NTP, is

/etc/init.d/ntp restart

the correct command? Can I stop the NTP with

/etc/init.d/ntp stop

and then restart it using

/usr/sbin/ntpd -g

Thanks again for your time.

Cheers,
Tony

tronayne · 05-14-2013, 09:56 AM

You have init.d, so that's what you want to use to start, stop and restart the daemon. It's just a file, you can look at it (it's similar to the /etc/rc.d/rc.ntpd on my machines). There should be a start, stop, restart in it and you should be able to

Code:

/etc/init.d/ntp restart

If you make any changes to /etc/ntp.conf you will have to restart the daemon (and be sure to only have one instance of the daemon running; use ps -ef | grep ntpd, you should see only one. If there are more than one, you can manually kill the PID for all of them with

Code:

kill -9 PID

where PID is the number shown by ps -ef | grep ntpd; e.g.,

Code:

ps -ef | grep ntpd
root      1910     1  0 May07 ?        00:00:17 /usr/sbin/ntpd -g -p /var/run/ntpd.pid
root      1920     1  0 May07 ?        00:00:17 /usr/sbin/ntpd -g -p /var/run/ntpd.pid
trona     3505  3490  0 10:45 pts/0    00:00:00 grep ntpd

You would kill both of those with

Code:

kill -9 1910 1920

(don't bother killing the "grep" one, it won't exist).

Then start the daemon with

Code:

/etc/init.d/ntp start

The status stuff doesn't get generated too often so you won't see content for a while (I don't remember how often, but it's not minute-by-minute, more like hours-by-hours, maybe lots of hours).

I just love systems that vendors lock in so that you can't do anything with them (it's the Microsoft Click-'n'-Drool school taken from the sublime to the ridiculous, I think). They figure you're too stupid to read and follow directions so they do it to you like it or not. Kind of guys I won't buy anything from, them.

Hope this helps some.

tony359 · 05-14-2013, 10:07 AM

Thanks.

I won't mention what the system is and what it's for but believe me: they're right!!

I'll follow your suggestions and came back in a few days' time as usual!

cheers,
Tony

tony359 · 05-16-2013, 02:38 AM

Ok, Huston, we've got a question!

Question 1: after a day or more still no sign of files in /var/log/ntpstats.
Question 2: today I log in to find that all the NTP servers are marked as "reject", the reach field said "7" then moved to "17" i think, then "3" which I understand it's not a good thing. I have monitored the first server and it looks like it's not responding? Please see below.

All servers are pingable from the terminal.

Code:

ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l    6   64  377    0.000    0.000   0.977
 158.43.128.66   193.67.79.202    2 u  985 1024    7   12.126  -133.29  17.553
 81.168.77.149   82.219.4.30      3 u   19 1024   17   39.850  -149.68  15.667
 130.88.200.4    194.66.31.14     2 u   37 1024   17   16.086  -142.64  16.331


ntpq> as

ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 52841  90f4   yes   yes  none    reject   reachable 15
  2 52842  96f4   yes   yes  none  sys.peer   reachable 15
  3 52843  90f4   yes   yes  none    reject   reachable 15
  4 52844  90f4   yes   yes  none    reject   reachable 15
ntpq>

Code:

ntpq> rv 52842
assID=52842 status=96f4 reach, conf, sel_sys.peer, 15 events, event_reach,
srcadr=158.43.128.66, srcport=123, dstadr=10.1.1.30, dstport=123,
leap=00, stratum=2, precision=-18, rootdelay=8.102,
rootdispersion=11.826, refid=193.67.79.202, reach=017, unreach=0,
hmode=3, pmode=4, hpoll=10, ppoll=10, flash=00 ok, keyid=0, ttl=0,
offset=-147.987, delay=12.032, dispersion=948.986, jitter=14.697,
reftime=d53f0592.32cc2000  Thu, May 16 2013  8:13:22.198,
org=d53f060d.c3a8d000  Thu, May 16 2013  8:15:25.764,
rec=d53f060d.eb158e06  Thu, May 16 2013  8:15:25.918,
xmt=d53f060d.e7fdb322  Thu, May 16 2013  8:15:25.906,
filtdelay=    12.03   12.13   11.77   12.02    0.00    0.00    0.00    0.00,
filtoffset= -147.99 -133.29 -115.74  -98.83    0.00    0.00    0.00    0.00,
filtdisp=      0.98   16.36   31.72   47.08 16000.0 16000.0 16000.0 16000.0

Another RV after 1024s

Code:

ntpq> ntpq> rv 52842
assID=52842 status=90f4 reach, conf, 15 events, event_reach,
srcadr=158.43.128.66, srcport=123, dstadr=10.1.1.30, dstport=123,
leap=00, stratum=2, precision=-18, rootdelay=8.057,
rootdispersion=11.978, refid=193.67.79.202, reach=001, unreach=1,
hmode=3, pmode=4, hpoll=6, ppoll=6, flash=400 peer_dist, keyid=0, ttl=0,
offset=-2.708, delay=10.554, dispersion=7937.990, jitter=0.977,
reftime=d53f0992.329f7000  Thu, May 16 2013  8:30:26.197,
org=d53f0a24.befa2000  Thu, May 16 2013  8:32:52.746,
rec=d53f0a24.c105701c  Thu, May 16 2013  8:32:52.753,
xmt=d53f0a24.be4e0419  Thu, May 16 2013  8:32:52.743,
filtdelay=    10.55    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=   -2.71    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=      0.98 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0
ntpq>

I can try and restart the NTP, but I would like to know the root cause first!

This is my current configuration

Code:

server 127.127.1.0
fudge 127.127.1.0 stratum 10
server 158.43.128.66
server 81.168.77.149
server 130.88.200.4
driftfile /etc/ntp/drift
logconfig=allclock +allpeer +allsys +allsync
logfile /var/log/ntp.log
multicastclient 224.0.1.1
broadcastdelay 0.008
restrict 127.0.0.1
statsdir /var/log/ntpstats/
filegen peerstats file peerstats type day enable
filegen loopstats file loopstats type day enable
filegen clockstats file clockstats type day enable

My ntp.log does not show anything since the 13th, I must have done something wrong. The only thing I have added was that "logconfig", is that ok?

Thanks!

tronayne · 05-16-2013, 07:41 AM

OK, I've just turned logging and statistics back on (and restarted NTPD).

Log file:

Code:

# Start ntpd:
ntpd_start() {
  # Clear the log file
  >/var/log/ntp.log
  CMDLINE="/usr/sbin/ntpd -g"
  echo -n "Starting NTP daemon:  $CMDLINE"
  $CMDLINE -p /var/run/ntpd.pid -l /var/log/ntp.log
  #$CMDLINE -p /var/run/ntpd.pid
  echo
}

The log entries so far:

Code:

cat /var/log/ntp.log
16 May 08:13:04 ntpd[32453]: proto: precision = 2.334 usec
16 May 08:13:04 ntpd[32453]: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16
16 May 08:13:04 ntpd[32453]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
16 May 08:13:04 ntpd[32453]: Listen and drop on 1 v6wildcard :: UDP 123
16 May 08:13:04 ntpd[32453]: Listen normally on 2 lo 127.0.0.1 UDP 123
16 May 08:13:04 ntpd[32453]: Listen normally on 3 eth0 192.168.1.10 UDP 123
16 May 08:13:04 ntpd[32453]: Listen normally on 4 eth0 fe80::210:18ff:fe8a:82c1 UDP 123
16 May 08:13:04 ntpd[32453]: Listen normally on 5 lo ::1 UDP 123
16 May 08:13:04 ntpd[32453]: peers refreshed
16 May 08:13:04 ntpd[32453]: Listening on routing socket on fd #22 for interface updates
16 May 08:13:04 ntpd[32453]: Listen normally on 6 multicast 224.0.1.1 UDP 123
16 May 08:13:04 ntpd[32453]: Joined 224.0.1.1 socket to multicast group 224.0.1.1

The statistics (in /etc/ntp.conf:

Code:

# Statistics stuff
#
  statsdir /var/log/ntpstats/	# directory for statistics files
  filegen	peerstats file peerstats type day enable
  filegen	loopstats file loopstats type day enable
  filegen	clockstats file clockstats type day enable

The content of /var/log/ntpstats:

Code:

ls -l /var/log/ntpstats
total 16
-rw-r--r-- 2 root root 295 May 16 08:17 loopstats
-rw-r--r-- 2 root root 295 May 16 08:17 loopstats.20130516
-rw-r--r-- 2 root root 989 May 16 08:18 peerstats
-rw-r--r-- 2 root root 989 May 16 08:18 peerstats.20130516

Each of those files have content; e.g., loopstats:

Code:

cat /var/log/ntpstats/loopstats
56428 43985.151 0.000000000 -11.874 0.000001907 0.000003 6
56428 44049.151 0.000000000 -11.874 0.000001907 0.000002 6
56428 44113.151 0.000000000 -11.874 0.000001907 0.000002 6
56428 44177.151 0.000000000 -11.874 0.000001907 0.000002 6
56428 44241.151 0.000000000 -11.874 0.000001907 0.000002 6

At this writing, NTP has not synchronized (takes a few minutes -- it's still looking at localhost) and

Code:

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*127.127.1.0     .LOCL.          10 l   55   64  177    0.000    0.000   0.002
 69.164.217.193  128.59.59.177    3 u   43   64  177  1270.55   25.188  28.224
 50.116.55.161   192.5.41.40      2 u   45   64  177  1328.45  -35.175  42.605
 128.113.28.67   18.26.4.105      2 u   43   64  177  1313.59  -16.841  56.794

OK, it synchronized:

Code:

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 127.127.1.0     .LOCL.          10 l   65   64  376    0.000    0.000   0.002
+69.164.217.193  128.59.59.177    3 u   43   64  377  1066.87  -99.831 128.517
+50.116.55.161   192.5.41.40      2 u   46   64  377  1381.67  -22.610  90.407
*128.113.28.67   18.26.4.105      2 u   46   64  377  1325.91   26.532  36.315

It'll take some time for it to settle down and, maybe, reject one or both of the "+" addresses, but the offset to the "*" is 26.532 milliseconds so it's a happy camper at the moment. There is no indication in the log file that it synchronized (and won't be for a while) and there haven't been any "throw out one of these and get another time source" messages (there will be, but not for some time and not too often -- it'll only change if a better source of time is found when it analyzes periodically).

Note that I did not turn on logging in /etc/ntp.conf, I used the -l /var/log/ntp.og in the start up command (the one in /etc/init.d for you, as above). I'll let everything settle down for while then change that for the internal logging and see what happens (actually, I know that it'll work, I just don't find that logging interesting and prefer the "command line" option). Actually, I've shut logging and statistics off a year or two ago on all servers and don't really remember all that much about either anymore. Sigh.

OK, things have settled down:

Code:

ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 127.127.1.0     .LOCL.          10 l  920   64    0    0.000    0.000   0.000
+69.164.217.193  128.59.59.177    3 u   39  128  377  1312.52   38.031  35.231
*50.116.55.161   192.5.41.40      2 u  119  128  377  1316.28   45.418  20.799
+128.113.28.67   18.26.4.105      2 u   43  128  377  1403.99   47.553  51.225

and all is well that ends.

Hope this helps some.