Glossary for high-availability and production engineering
Plain-English definitions for terms that appear in incident reviews, cluster design, observability work, and high-load architecture decisions.
A SPOF is one part of a system that can take the whole service down when it fails.
FencingFencing is the act of cutting a broken or uncertain node off before another node takes over its work.
STONITHSTONITH is a blunt cluster safety mechanism that powers off or isolates a suspect node.
Split-brainSplit-brain happens when parts of a system lose contact and more than one side thinks it is in charge.
QuorumQuorum is the rule that says a cluster may act only when enough trusted members agree.
FailoverFailover is moving service from a failed or unhealthy component to a prepared replacement.
Load sheddingLoad shedding is intentionally refusing or delaying lower-value work so the important path stays usable.
Recording ruleA recording rule saves the result of a metrics query so dashboards and alerts can read it cheaply later.