[LEAPSECS] Leapseconds, more evidence
seaman at noao.edu
Mon Jul 2 16:40:21 EDT 2012
Interesting. A few questions spring to mind. Let me preface them by stating that this is far from my area of expertise and I'd be delighted to be educated here.
1) What are the units on the y-axis?
2) This shows the surge continuing for at least half a day after the event with no significant downward jumps from servers being rebooted, fixes being applied (as described in previous emails), or other apparent attempts at addressing the situation. Was nobody monitoring the data center? Weren't alerts sent out to offsite staff? Did they attempt to implement any fixes?
2a) Do we know if any of the datacenter customers had prepared their workflows/systems in advance of the leap second? That is, do we know whether the left hand side might be lower than normal?
3) There is a shallow decreasing trend after the event. Is this what one would expect from affected servers left on their own? Does the problem(s) resolve itself eventually or is operator interaction required?
3a) What happened after this plot was made? Perhaps staff finally arrived and gave it a kick? Did the power ramp back down to normal? Judging from the trend in the plot, they will otherwise return to normal around midday tomorrow.
4) The upward jump is very rapid. Hard to tell from the scale, but it appears to be a facility-wide 15% jump in, say, one minute or less. Is this supportable by the power infrastructure onsite or on the local grid (assuming it is on the grid)? Is the plot smoothed in any way?
5) This is a complex figure-of-merit and only distantly related to server processing load. Might one expect the power usage to exhibit ringing behaviors from the rapid jump? What is the typical mix of power consumption in a data center between CPUs, disks, and cooling, etc? Wouldn't these each have different time constants that a naive viewer (i.e., me) might think would be visible?
6) The small scale structure remains similar before-and-after the event. Wouldn't one expect the system loading to interact in some complex way with the actual workflows the data center is in business to serve? One might think the small scale would become either more variable or perhaps even flatten out as CPUs pegged. As the CPUs pegged wouldn't disk I/O decrease (or at least change in some fashion)?
7) This is a 24 hour plot. I guess the hourly peaks would be some sort of housekeeping workflows, and there surely would be load-balancing to squeeze the most out of the datacenter - but is it normal to otherwise see no diurnal variation? And if there is load-balancing, is that the 2-3 day ramp we're seeing after the event? One would think the time-constant would be much more rapid.
8) Is a typical day flat (outside small scale structure) at just about 910000 units? Is there usually a difference between a Saturday night and a Sunday morning?
Provenance would be appreciated. Whatever our positions on the issues, they'll be strengthened by something more reliable than "I'm told that". For instance, what's the mix of host OSes - is there otherwise reason to believe the datacenter was a candidate for the various issues described to date?
I don't suppose this particular datacenter is used by Amadeus? :-)
On Jul 2, 2012, at 11:53 AM, Poul-Henning Kamp wrote:
> I'm told that this is the power usage from one of Hetzner's data centers over the leap-second:
More information about the LEAPSECS