NTP offset
Hello everybody!
My first message here! I am dealing with an NTP issue. I have searched the web for a couple of days, got a basic knowledge of how the NTP protocol works, but still I am a bit puzzled and I have a few questions. Everything started some of our servers could not keep up with time, ending up with awful OFFSET values and drifting. I understand that the NTP is not a simple synchronisation task that steps the time every time it's run: it's an alghoritm, that pools the time from a number of server and assess the accuracy of the system clock, coming up with a way to slew it so the user will never see the time changing. Only when the time is wildly out, will the NTP step the time "one off" for restoring the time. I also understand that the NTP will step the time when the offset is over 125ms and it would refuse to operate when the clock is more than 1000s off. Also, that NTP has a limitation of - if memory serves - 49s per day. I have set my NTP on my server using 4 external NTP servers, stratum 1, 2 and 3. It looks like the offset varies from 0 to something as big as 600, and I do not know why. Yesterday I first set my clock 'one off' manually (using NTPDATE, with the NTPD off) then I amended my configuration file and started the NTP deamon. After a while the NTP.DRIFT file was populated. I have been monitoring the NTP using NTPQ and I could not find anything obvious (to me). My problem is that the OFFSET value randomly jumps from 0-ish to 300/500 and I am not sure that that behaviour is normal? I will keep monitoring with my last configuration (previous time I was only using one NTP server). My other servers eventually drift to 30000/50000 until the NTP comes up with a "frequency error". Here is my current configuration (while I was writing, the offset drifted to 1000!) Code:
server 130.88.200.4 prefer iburst true I understand that this could be caused by a hardware clock too drifty, but I am still puzzled. Also, over 128ms the NTP deamon should step the time, why does that not happen? What I did notice is that the value in the DRIFT file is changing. It was 49, then -5.99, now it's 24.032. My logs do not show anything strange Code:
7 May 18:27:32 ntpd[4093]: synchronized to 130.88.200.4, stratum 2 Can anybody help me? Please let me know if you need further details. Thanks! Tony |
You're correct that the offset values look pretty much like a problem and that problem is network related (almost always).
You don't say where in the world you are (I'll assume North America); here's a suggestion that you try the pool servers plus the "local clock" as here: Code:
server 127.127.1.0 # local clock Code:
ntpq -pn The inclusion of Code:
server 127.127.1.0 # local clock What NTP does is evaluate the time servers to select the "best" one of them to synchronize to. That will change from time to time -- it will replace servers that are slow, noisy, or just plain gone out of service periodically which is why it's a good idea to use the pool servers rather than specifying specific addresses. The drift file, generally /etc/ntp/drift, will change over time as NTP evaluates your system clock versus a time standard. That value should not swing wildly, but it will change for a while then settle down over time -- it can take a few days for that to happen. When you first implement NTP your system clock may be in never-never land somewhere and it's a good idea to initially set the system clock with ntpdate. Once NTP synchronizes, though, you should not need to do that. The system clock, which is a "software clock," run by the kernel via interrupts. On boot, it is initially set from the hardware clock then NTP keeps it on-time once synchronized to an external time source. My systems are currently synchronized to Code:
*50.116.55.161 192.5.41.40 2 u 730 1024 377 1280.03 -22.906 32.752 The routine that starts the NTP daemon should look a lot like this: Code:
#!/bin/sh So, how does the system time get set? At boot, it should be set from the hardware clock; looks something like this: Code:
# Set the system time from the hardware clock using hwclock --hctosys. On shutdown, the opposite happens: Code:
# Save the system time to the hardware clock using hwclock --systohc. So, bottom line here -- your ntpq display looks like you've hard-defined time sources that may not be worthwhile and you might want to try the pool servers (for a day or so) and see if you get better results. That "for a day or two" is meaningful -- NTP takes time to settle down, it does adjustments over time so let run for a few days and see. I would remove the multiple "prefer" and "iburst" options from you configuration (you really don't need them and multiple "prefer," well, not good -- see the comment in the ntp.conf file below). Just in case it helps, here's a long-term, known-good ntp.conf you may find interesting; the stuff that's commented-out just is not used: Code:
cat /etc/ntp.conf |
Tronayne,
Thanks for the very detailed reply, it is really helpful. I do appreciate the time you've taken to write the post. I have only a couple of problems here. It looks like I do not have DNS on this server - I cannot ping the pool servers, or any www sites, so I am assuming the DNS is not on. It's a special purpose server and the configuration is done by the manufacturer. I have the root access and I can change the NTP configuration file - which is going to be replaced by the 'factory' one on reboot - but so far I haven't got the DNS working). I tried the pool servers before and it would not work. Is there a way to use the pool servers without DNS? Can I just find out the IP number of 0.uk.pool.ntp.org? (and BTW I am in the UK!) Also, what puzzles me is that everything seems ok but there is nothing to tell you that the NTP is not actually working? Is there a sort of debug mode or log where I can see what is actually happening? Not that I do not trust you, but it seems strange to me that everything looks fine and there is not a way to find out what is wrong. Finally, the manufacturer actually suggested me to change the pooling time, reducing the maximum time from 1024s to 256s (maxpool=8). After what I found, I feel that this is not a solution. I think that this wrongly assumes that when the NTP pools a server, it syncs the clock on it, while the entire algorithm is actually constantly evaluating the system clock. My opinion is that reducing the pool time is not going to improve things here. Your opinion? I will try your configuration file (maybe using static NTP servers for the time being) and I'll post the results shortly. Thanks! Tony |
You can, of course, ping the pool servers (on a machine with DNS). Downside: they do change from time to time (which is, of course, why you use the pool servers in the first place).
Have you tried adding a DNS server to /etc/resolv.conf? Of the form Code:
earch com If NTP is not working (like it died) you won't see something like Code:
ps -ef | grep ntpd Code:
ntpq -pn If you must use addresses, ping them like this Code:
ping -c 5 65.23.154.62 I'd leave the pool time at the default value unless somebody could prove to me that changing is actually beneficial. I can't imagine any good reason to do that unless the manufacturer has some magical mystical thing going that would be reasonable. Once you're synchronized, you're synchronized and NTP is managing your system clock; end of story. Are you using this gadget as your internal network time server? Hope this helps some. |
I forgot...
If you want to log NTP in a more-or-less useful way, try something like this (in the NTPD start up file): Code:
# Start ntpd: Code:
>/tmp/ntp.log Code:
tail -f /tmp/ntp.log Hope this helps some. |
Hi,
No, it's a multimedia server. It streams audio and picture to another device. I don't need it to be massively precise, but it may be a problem if after 6 months the clocks are +/- 5 minutes! I'd rather use IPs for the time being. I selected the ones you saw by pinging them, as you suggested. I will scrap the stratum 1 servers - I did notify them though! - and add everything else you mentioned. Yes, I do understand that the NTP daemon is working, but I'm still puzzled that there is not a diagnostic tool that tells us what is actually wrong. The offset is going up, we *assume* it's due to the network. It would be nice to have a set of tools that could show what is exactly going wrong and why. (just seen the addendum,thanks, I'll give it a try!) But I'm happy with the empiric way! I'll keep doing tests and I'll update soon. Thanks again Tony |
Quote:
For a tiny offset ntpd will adjust the local clock as usual; for small and larger offsets, ntpd will reject the reference time for a while. In the latter case the operation system's clock will continue with the last corrections effective while the new reference time is being rejected. After some time, small offsets (significantly less than a second) will be slewed (adjusted slowly), while larger offsets will cause the clock to be stepped (set anew). Huge offsets are rejected, and ntpd will terminate itself, believing something very strange must have happened. Do you see ntpd terminated in your log? ntpq -pn displays the offsets for each reachable server in milliseconds (ntpdc -p uses seconds instead). ntpdc -c loopinfo displays the combined offset in seconds, as seen at the last poll. If supported, ntpdc -c kerninfo will display the current remaining correction, just as ntptime does. For example, Code:
ntpdc -c loopinfo Code:
ntpdc -c kerninfo Code:
ntptime Bottom line? Make sure you've got electrically close servers defined (meaning shortest ping times) or use the pool servers (meaning you most likely have to put entries in /etc/resolv.conf), start 'er up and let it run for at least three days -- it takes that long for things to really settle down. Hope this helps some. |
Quote:
I'll be back in three days! :) Thanks! edit: in the meantime... :) Monitoring the log as you suggested, it came up with this: Quote:
|
Hello,
It seems to work, as predicted. I think I read somewhere that there is a way to graph the offset value - or just to log it, I'll then process the file. Any thoughts on how can I do it? I'll come back with more details later. edit: would that work? I can't see any files in the /var/log/ntp folder, are they only created at the end of the day? Code:
server 127.127.1.0 Tony |
So, NTP has walked your clock into synchronization? OK, that tells that everything is working as it should (and you can probably stop fiddling with it and just ignore it).
Note that there are at least two ways to log what NTP is doing. One is the simple log that gets started by Code:
# Start ntpd: Another way is here: If you've defined your logging to go into /var/log/ntp, you need a definition of that in /etc/ntp.conf; e.g., Code:
# Code:
# Statistics stuff Code:
su - There is a great deal of information about monitoring at file:///usr/doc/ntp-4.2.6p5/html/monopt.html (that's the NTP manual that should be installed in /usr/doc/ntp-whatever/monopt.htm on your system; if it's not in /usr/doc, look around for it or go to http://www.ntp.ogr and see the documentation pages. Now, in most installations there are examples found in /usr/doc/ntp-whatever/scripts. In particular, /usr/doc/ntp-whatever/scripts/monitoring, where you will find a README file and a group of utilities for dealing with the statistics data (and some other stuff). Read the README. Pay attention to the warnings. Really, pay attention to the warnings. There are examples that feed data to gnuplot (you should have that on your system, it's kind of a standard -- if not, go get and install it from your distribution software archive). That's what you use to make pretty graphs. Be aware that the sample utilities will most likely require a little editing for the path to the log files you've created -- the paths and file names I suggested above and elsewhere may not be what's in those example files and you'll need to twiddle some things to get them going. At one time, some years ago, I did use the statistics to look at some 250 machines using a central NTP time server (25 Solaris boxes, 4 Linux servers [I did say this was a long time ago] and the rest desktop winders things). To be honest, it was more information than I needed and I turned it off after a month or so -- NTP does work and keeps on working if you get it configured and then just leave it alone and let it do its thing. One final thing you'll want to do (if you're logging) is rotate the logs (they can grow quite large). Somewhere, hopefully in /etc/logrotate.d, you will want to put this: Code:
cat ntpd Code:
# Empty the log file It'll take some fiddling and twiddling and reviewing what you get until you're happy with the information (and not, you know, overwhelmed by it). Take it slow, try one thing at time and you'll get there. Hope this helps some. |
Thanks again, you're a mine of clear and precious information.
I cannot "leave" it unfortunately! I'll explain why: as mentioned, this are multimedia servers. The NTP.conf is created during boot by the software that runs on the server. When configuring this software, you are allowed to enter only ONE NTP server, which is then embedded in their configuration, which is similar to the one I quoted in my first post - but only with one server. Now, I have found that the NTP does not work on these servers, apparently we now know why. I am working with the manufacturer - which came up with the suggestion I mentioned. I need to show them that my configuration works. To do so, I need to graph the offset, to confirm them that it eventually settles around zero. I will graph another machine with the configuration the manufacturer suggested and then I'll report to them. Bottom line, I need the manufacturer to fix the configuration in their configuration files! And I need to show them that there is a problem with the configurations they provide. I have amended the configuration with the statistics you suggested. Am I expecting to see something happening in /var/log/ntpstats/ in the short terms or will the file be created after a day? Also, to restart the NTP, is /etc/init.d/ntp restart the correct command? Can I stop the NTP with /etc/init.d/ntp stop and then restart it using /usr/sbin/ntpd -g Thanks again for your time. Cheers, Tony |
You have init.d, so that's what you want to use to start, stop and restart the daemon. It's just a file, you can look at it (it's similar to the /etc/rc.d/rc.ntpd on my machines). There should be a start, stop, restart in it and you should be able to
Code:
/etc/init.d/ntp restart Code:
kill -9 PID Code:
ps -ef | grep ntpd Code:
kill -9 1910 1920 Then start the daemon with Code:
/etc/init.d/ntp start I just love systems that vendors lock in so that you can't do anything with them (it's the Microsoft Click-'n'-Drool school taken from the sublime to the ridiculous, I think). They figure you're too stupid to read and follow directions so they do it to you like it or not. Kind of guys I won't buy anything from, them. Hope this helps some. |
Thanks.
I won't mention what the system is and what it's for but believe me: they're right!! I'll follow your suggestions and came back in a few days' time as usual! cheers, Tony |
Ok, Huston, we've got a question! :)
Question 1: after a day or more still no sign of files in /var/log/ntpstats. Question 2: today I log in to find that all the NTP servers are marked as "reject", the reach field said "7" then moved to "17" i think, then "3" which I understand it's not a good thing. I have monitored the first server and it looks like it's not responding? Please see below. All servers are pingable from the terminal. Code:
ntpq -p Code:
ntpq> rv 52842 Code:
ntpq> ntpq> rv 52842 This is my current configuration Code:
server 127.127.1.0 Thanks! |
OK, I've just turned logging and statistics back on (and restarted NTPD).
Log file: Code:
# Start ntpd: Code:
cat /var/log/ntp.log Code:
# Statistics stuff Code:
ls -l /var/log/ntpstats Code:
cat /var/log/ntpstats/loopstats Code:
ntpq -pn Code:
ntpq -pn Note that I did not turn on logging in /etc/ntp.conf, I used the -l /var/log/ntp.og in the start up command (the one in /etc/init.d for you, as above). I'll let everything settle down for while then change that for the internal logging and see what happens (actually, I know that it'll work, I just don't find that logging interesting and prefer the "command line" option). Actually, I've shut logging and statistics off a year or two ago on all servers and don't really remember all that much about either anymore. Sigh. OK, things have settled down: Code:
ntpq -pn Hope this helps some. |
All times are GMT -5. The time now is 12:17 PM. |