[LEAPSECS] Leap seconds ain't broken, but most implementations are broken

Thu Jan 5 09:00:37 EST 2017

Tony Finch wrote:
> Martin Burnicki <martin.burnicki at burnicki.net> wrote:
>>
>> Please note that NTP servers not necessarily need to be providers for
>> leap second files. There are some well known sites which provide this
>> file, and the NTP software package from ntp.org comes with a script
>> which can be used to update the file automatically.
> 
> I was thinking more that an NTP client or server would use its leapseconds
> file for validating LI bits from servers and for determining when they
> should leap.

In fact, the current implementation of ntpd uses exclusively the leap
second file if it has been configured, is available and not expired.

However, as I've pointed out earlier, this helps *only* to get the leap
second itself right.

If upstream servers aren't aware of the leap second and don't insert it,
then our server will see that its time is off by 1 sec after it has
(correctly) inserted the leap second. So a few minutes after the leap
second our server will step its time to match the time of its upstream
servers, and thus effectively undo the leap second it has inserted.

That's why a leap second file is only of limited help in cases where
other time sources don't get the leap second right.

> My thinking is that routine software patching and security updates happen
> often enough that maybe NTP can get leap second more reliably out-of-band
> instead of using in-band leap indicators from upstream servers.
> Lower-stratum devices could use their own leap second information to
> correct for operational or implementation errors upstream.
> 
>> The potential approach with tzdist or special DNS allowed for a
>> distributed system, where the special DNS can only provide leap second
>> warning and the current TAI offset, while tzdist also provides the leap
>> second history, and a way to update time zone rules, so it could be
>> generally used to keep also conversion to local time correct.
> 
> Oh, I forgot about the DNS publication scheme. That would also help a lot
> if it were implemented. And maybe better than relying on sufficiently
> frequent software updates.

Yes. Once a time client software has been deployed which can use the
specific DNS (or tzdist), this continues to work as long as the servers
stay reachable, without having to apply software updates (think of
embedded devices).

>> Comparing to your example with DNS: If a root server has a software bug
>> which lets it deliver a wrong IP address, how should your local DNS
>> resolver detect this?
> 
> My analogy was more along the lines of, when a root server IP address
> changes, the DNS server notices and logs that it has out of date hints.
> It's not a great analogy though, because if the DNS server has the wrong
> data about one root server, it can recover using the other 12, but if an
> NTP server has wrong data about the next leap second, it's screwed.

Again, I find this comparison not quite appropriate, you write:

If a DNS server has the wrong data about one root server, it can recover
using the other 12

Similarly, you can configure an NTP server to use several upstream NTP
servers, so if one of them fails, our server can discard it and use the
remaining ones.

IMO a stratum 1 NTP servers are more comparable to a root DNS server: if
it fails because its refclock fails then this server is wrong, but
secondary servers can detect this and ignore it, if there's sufficient
redundance as with the 13 root DNS servers.

>>> I wonder if it would be better to set the leap indicator bits to NOSYNC if
>>> the configured leap seconds file has expired.
>>
>> Sounds good at the first glance, but I think this would cause much bad
>> surprise if you have a company network and suddenly all NTP clients stop
>> to be synchronized.
> 
> Well, at that point (end of June or December) they don't know if there
> should be a leap second or not, so they can't reliably tell the right
> time.

If those servers stop providing time then the clients don't have the
right time, either. Instead, the clients would in addition start to
drift apart. That's not what you want.

> Maybe they should fall back to relying on leap indicators from upstream,
> but they need some way to make it obvious they might fail.

They do this if the leap second file is expired, at least since ntpd 4.8.2.

>> The basic problem is more with a stratum-1 server which in many cases
>> gets its time only from a GPS receiver. If the GPS receiver provides
>> faulty leap second information then the NTP server can hardly detect
>> this. Even if it has a current leap second file this wouldn't work.
> 
> I don't understand. If it has a current leap second file, it can use that
> to detect that its GPS receiver screwed up the leap second. It should then
> go NOSYNC.

Yes, and after the leap second synchronize again to the faulty GPS
receiver. This happened anyway if you restarted the NTP daemon. So your
server has again a time which is 1 s off. See my explanation above.

It's not only important to get the leap second right, but also to keep
the right time continuously. You can't do this by hacking the NTP daemon
software. You need to fix the GPS receiver to achieve continuous
reliable operation.

>> For a pure client there should be no problem if the client has several
>> upstream servers configured. Before the leap second, the NTP daemon
>> accepts a leap second warning only if a majority of the configured
>> upstream servers provide this warning. However, the time from the faulty
>> server is still correct. Otherwise it wouldn't have been classified as
>> good candidate.
>>
>> When the leap second occurs then all upstream servers as well as our
>> server insert the leap second, but faulty servers don't. So the faulty
>> servers which haven't done the leap second are off by 1 s afterwards and
>> are *then* classified as false tickers.
>>
>> Of course this also doesn't work correctly if the *majority* of the
>> configured upstream servers get the leap second wrong, but in the past
>> we have seen that fortunately most public servers get it right.
> 
> That seems fairly reasonable, but maybe you could make it more reliable
> using a leap seconds file to deal with cases when too many of the
> upstreams are wrong.

Nope. See my explanation above. It will get the leap second itself
right, but the time will go wrong a few minutes after the leap second.

Martin