Failover definition - MSMSoft glossary

People often talk about failover as if it is a button. In production it is a contract: what failure is detected, who decides, what must be fenced, what state must move, how clients reconnect, and how long the uncertain window may last. A clean failover is usually the result of many small details being boring at the same time.

A practical example is a primary database whose host freezes during maintenance. Monitoring notices loss of useful work, the cluster confirms the old primary cannot write, the standby promotes, application pools reconnect, and traffic resumes. The failure mode is when one of those steps is implicit. Maybe the health check sees a pid and calls the service healthy. Maybe DNS TTL is longer than the incident. Maybe clients retry so aggressively that the new primary is overloaded before it warms up.

Good failover design includes timers, ownership rules, client behavior, rollback expectations, and regular drills that cover ugly failures, not only graceful restarts.