The incident started with a boring graph. One host went down for planned maintenance, the standby took over, and the service stayed available. No customer saw a full outage. On paper, that sounds like a win.
Then we looked at the trace. The failover had waited 480 ms longer than the design allowed. Not 480 ms of application work. Not 480 ms of database recovery. Just waiting. A quiet pause between the moment the primary became useless and the moment the rest of the system admitted it.
That kind of delay is easy to ignore because it does not always break the page. It shows up as a small shelf in tail latency, a handful of retries, maybe one noisy alert that clears itself. Everybody moves on. Six months later the same pause combines with a slow disk, a noisy neighbor, and a backup job, and now the standby looks broken too.
The cluster manager was doing exactly what it had been told to do. That was the first uncomfortable fact. The timers were conservative because years earlier someone had been afraid of false positives. The health check had three layers because each layer had once caught a different failure. The fence device was reachable, but not fast. Every single choice had a reason. Together, they made the system slow to believe what was already true.
We changed less than people expected. The useful work was not replacing the cluster stack. It was writing down the actual failure contract: how long the service may be uncertain, what counts as dead, who is allowed to make that call, and what must be fenced before traffic moves. Once that was explicit, the timer changes were small.
The test that mattered was not a demo failover. Demo failovers lie. They happen when everyone is watching and the system is clean. We ran the boring cases: process hung but pid alive, storage stalled, primary isolated from peers but still reachable from a monitoring host, standby already under load, operator running the command twice. The 480 ms delay did not survive those tests.
High availability is not the absence of downtime. It is the removal of ambiguous time. The dangerous window is often short enough to hide in averages and long enough to ruin a bad day.