[LEAPSECS] Future Leap Seconds

Tue Jan 27 18:36:40 EST 2015

> On Jan 27, 2015, at 4:15 PM, Peter Vince <petervince1952 at gmail.com> wrote:
> 
> 
> On 27 January 2015 at 23:03, Warner Losh <imp at bsdimp.com> wrote:
> ...
> > A “cold spare” that’s sitting on the shelf powered off for more than 6 months. When
> > the original fails, the hot spare is returned to service and must wait an additional ~30
> > minutes to get the latest almanac before it can recover UTC time from GPS time.
> > This 30 minutes of down time puts the system at < 4 9’s of reliability, and is an
> > unacceptable delay. ...
> 
> That assumes you can instantly get the cold unit off the shelf and plugged in.  I suspect that actually doing that will, in most cases, take well over half an hour, and hence there goes your 4 9's, regardless of how quickly the unit can give an accurate output.

The time to detect a full failure of the system is measured in tens of seconds.
The time to dispatch remote hands to the rack with the system in it with a trip by the
cold spares room is likewise measured in high single digit minutes. The time to power
down the chassis, and swap out the failed system and boot the new system is measured
again in single digit minutes. Time from failure to reboot can be less than 10 minutes
in most cases (though not all). Having to wait the extra 30 minutes for the almanac
thus quadrupled the outage time. The system had a number of redundant elements
to it, and the ability to signal when it wasn’t running in fully redundant mode, but the
time spent on ‘not fully redundant’ was counted against the reliability goals. And since
the redundant elements weren’t supposed to know about each other, asking the redundant
element for leap information wasn’t an option, though we did find a way to store the data
in multiple places so that often (but not always) we could recover it and not pay the
30 minute penalty w/o forcing operator intervention. Since this was a military installation
as well, that added its own set of arbitrarily complications which were better worked
around in software. And since the military was involved, much analysis was done on
worst case scenarios, rather than typical scenarios which would have made the
problem simpler.

And yes, I’ve actually deployed systems like this, and argued for the simplifications
and information sharing that would obviate the need for an almanac download. While
clever and all that, they didn’t survive a worst case scenario analysis. It was utterly
stupid, but also utterly avoidable if you had 5 years of leap seconds available.

To make matters worse, this was the primary timing system for a well known navigation
system, so any outage was amplified well in excess of its actual effect, not to mention
requirements for data logging that were fundamentally at odds with fast restart.

This is a real-life example of the law of unintended consequences and well-meaning
good ideas...

Warner