What should be checked before PostgreSQL failover?

Check replication lag, candidate standby health, fencing or demotion guarantees, client routing, backups, rollback boundaries, and the exact proof that the old primary cannot accept writes.

Is switchover safer than failover?

A planned switchover is usually safer because both sides are reachable and controlled, but it still needs written stop conditions and verification of routing, replication, and data safety.

Does high availability replace PostgreSQL backups?

No. HA protects service continuity for some failures; backups and PITR protect recovery from data corruption, human mistakes, and failures that replicate bad state.

PostgreSQL failover checklist before maintenance

A PostgreSQL failover checklist is most useful before maintenance, not during the five minutes when everyone is staring at replication lag. The point is not to memorize a tool command. The point is to make the failure contract explicit: which node is allowed to be primary, which clients should move, how much data loss is acceptable, when to stop, and how the team will prove that the new state is safe.

1. Name the maintenance goal. Are you patching the primary host, replacing storage, upgrading PostgreSQL, testing a standby, moving traffic to another zone, or proving a disaster-recovery path? Different goals have different acceptable risks. A maintenance window that only needs a restart should not accidentally become a topology migration. Write the intended end state and the rollback state before touching the cluster.

2. Check replication health before the window. Confirm every candidate standby is streaming, current enough for the planned RPO, and not hiding apply delay behind a pleasant dashboard. Look at WAL receiver status, replay lag, replication slots, disk usage for retained WAL, and the timeline history. If a standby is already behind, maintenance is not starting from a safe baseline.

3. Confirm the authority model. In Pacemaker this means quorum, resource constraints, failed actions, location preferences, and STONITH. In Patroni it means DCS health, leader lock ownership, tags, synchronous mode, and REST health checks. In a managed service it means the provider failover contract. The team should know which system is authoritative before any human runs a promote command.

4. Verify fencing or demotion assumptions. A PostgreSQL primary that is merely unreachable is not the same thing as a primary that is safely stopped. If two systems can accept writes, the maintenance window has become a data-integrity incident. Check the fence path, cloud API permissions, watchdog behavior, shutdown command, or provider guarantee that prevents competing primaries.

5. Freeze unrelated change. A failover test should not overlap with a deploy, schema migration, cache change, worker rollout, or batch backfill unless that interaction is the thing being tested. Keep the blast radius small. If the failover behaves strangely, the team should not have to ask whether a separate product release changed the workload at the same time.

6. Prepare client routing. Know which clients use a virtual IP, DNS name, proxy, connection pooler, service discovery record, or hard-coded host. Check TTLs, pooler health checks, read/write split behavior, and application retry policy. Failover is not complete when PostgreSQL accepts writes; it is complete when the correct clients are writing to the correct place.

7. Define stop conditions. Stop if replication lag exceeds the written budget, if the fence path fails, if the standby is missing required extensions, if backups are not current, if the operator cannot verify the current primary, or if a health check disagrees with direct database evidence. Stop conditions protect the team from turning a controlled maintenance task into improvisation.

8. Take a fresh backup or verify a recent one. High availability is not backup. A failover can preserve service while preserving the wrong data. Before planned maintenance, confirm PITR status, WAL archive continuity, backup age, restore credentials, and the location where restore would actually happen. If nobody has recently restored, the backup is a belief, not evidence.

9. Run the failover command from the runbook, not from memory. The runbook should include exact commands, expected output, owner, rollback condition, and verification steps. If the command differs for planned switchover and emergency failover, make that distinction visible. Maintenance is where you discover whether the document still matches production.

10. Prove the new primary. Check timeline, read-write status, synchronous replication configuration, replication slots, extension availability, logical replication if used, and a safe application-level write. Then prove the old primary cannot still accept writes. This second proof is easy to skip and is often the difference between recovery and split-brain.

11. Watch the application, not only the database. After routing moves, inspect p95 and p99 latency, error rate, connection churn, pool saturation, queue age, and write success for the critical path. A database can look healthy while clients are reconnecting badly or while a pooler is pinning traffic to the old address.

12. Keep a rollback plan honest. Rollback may mean moving traffic back, rejoining the old primary as a replica, rebuilding it from backup, or explicitly not rolling back because the new timeline is authoritative. Do not decide this after the fact. Write the rollback boundary before the failover, especially when timelines, slots, or data writes make reversal expensive.

13. Record evidence. Capture timestamps, command output, lag before and after, routing changes, alerts fired, and recovery proof. This is not ceremony. It lets the next maintenance window start from knowledge rather than folklore. If the checklist found a surprise, open follow-up work while the evidence is still fresh.

A good PostgreSQL failover checklist is boring. It reduces a tense operation to a sequence of observable claims: this node is safe, this standby is current, this authority is active, these clients moved, this old writer is fenced, this recovery signal proves success. Boring is the goal. Boring maintenance is how high availability stays high availability instead of becoming a midnight story.

14. Check application read behavior after the switch. Many PostgreSQL deployments route reads to replicas even when writes move correctly. Confirm replica subscriptions, lag limits, read-only endpoints, pooler rules, and any reporting jobs that assume a specific host. A failover that restores writes but serves stale reads can still damage the product, especially when users expect immediate confirmation after a transaction.

15. Verify scheduled jobs and maintenance tasks. Cron jobs, backup jobs, vacuum tuning, logical replication workers, exports, and analytics pipelines may be pinned to the old primary or to a hostname that no longer means what it did. After failover, list the jobs that should run once, the jobs that should pause, and the jobs that must be moved deliberately. Duplicate background work is a common post-failover surprise.

16. Watch the old primary rejoin path. The riskiest moment after a planned failover is often when the former primary returns. It may need pg_rewind, rebuild from backup, manual slot cleanup, or explicit demotion. Do not let automation reintroduce it as a writer because it has the familiar hostname. Rejoin should be a separate checklist with its own proof of safety.

17. Close the loop with one customer-level transaction. Synthetic database checks are necessary but not sufficient. Run a safe application transaction through the normal routing layer, verify it reaches the new primary, verify the user-facing read path sees it correctly, and verify monitoring names the new topology. The system is recovered only when the product path agrees with the database evidence.

18. Check monitoring names after the topology changes. Alerts, dashboards, service ownership labels, and runbook links often assume the old primary hostname. If the new primary is healthy but the alert still points responders to the former node, the next incident will start with confusion. Update the operational view during the same window, not weeks later.

19. Decide who can call the window complete. The database operator, application owner, and incident lead may need different evidence. Completion should require database health, application routing, customer path verification, and a written note about anything deferred. Without a named closer, teams drift away from the window while hidden follow-up remains unfinished.

FAQ