Cluster failover, 14-month clean run

The service already had a standby, a cluster manager, and a maintenance process. What it did not have was a clear contract for the uncertain seconds between a primary becoming unhealthy and the standby being allowed to serve traffic. During planned work, that uncertainty occasionally became visible to users.

We reviewed the health checks, fencing path, quorum behavior, client reconnect settings, and operator commands used during maintenance. The strongest finding was that several checks proved the process existed but not that the service could still do useful work. Failover was technically present, but the cluster waited too long to believe the primary was lost.

The solution changed timing, not the whole platform. Health checks moved closer to user-visible behavior, fencing evidence became mandatory before promotion, and drills covered hung processes and network partitions instead of only graceful restarts. After the change, maintenance failovers ran cleanly for fourteen months without a user-visible event.