02 - Service

pacemaker corosync consultant for honest failover

A pacemaker corosync consultant helps when a cluster exists, but the team no longer trusts what it will do under stress. MSMSoft reviews high-availability designs for revenue-critical Linux services, with attention to fencing, quorum, health checks, failover timing, and the operator procedures that decide whether the standby is useful when the primary is not.

We work from the real failure contract: how long uncertainty is allowed, which component is authorized to call a node dead, what must be fenced, and how customers experience the transition. The output is not a decorative architecture diagram; it is a tested failover path and a list of assumptions that deserve operational respect.

Start an engagement

When you need a pacemaker corosync consultant

Pacemaker and Corosync are installed, but nobody wants to run a failover drill during business hours.
A node can be unhealthy while still responding to shallow health checks, so traffic moves too late or not at all.
Fencing works in demos but is slow, unreachable, or ambiguous during network partitions.
Split-brain prevention depends on tribal knowledge rather than quorum, resource constraints, and documented procedures.
Maintenance windows produce surprise alerts, manual commands, or one-off exceptions that never make it back into the cluster design.

How we work

Read the current cluster configuration, resource constraints, quorum policy, fence devices, service dependencies, and operator runbooks.
Map realistic failure modes: process hang, storage stall, partial network isolation, overloaded standby, bad health check, and repeated operator action.
Tune the smallest necessary set of checks, timers, ordering rules, colocations, and fencing behavior.
Run controlled drills that include ugly cases, not just the clean demo where everyone is watching.
Document what the cluster guarantees, what it deliberately does not guarantee, and how to recognize ambiguous state.

Selected work

2024

Cluster failover, 14-month clean run

A revenue-critical service had intermittent primary-node failures during maintenance windows.

Reworked health checks, fencing, and failover timing so traffic moved before user-visible failure.

Related field notes

HAA failover that waited 480 ms longer than planned6 min HAPacemaker vs Patroni for PostgreSQL high availability18 min HAPostgreSQL failover checklist before maintenance9 min

Pacemaker Corosync consultant work starts with a simple question: when the primary is bad, who is allowed to believe it? Many clusters fail not because the software is weak, but because the failure contract is vague. Timers are conservative because someone once feared false positives. Health checks are shallow because deep checks caused noise. Fencing is configured but rarely timed. The result is a system that looks highly available in inventory and hesitant in production.

We focus on the dangerous window between failure and agreement. In that window, clients retry, databases accumulate half-open work, operators wonder whether they should intervene, and the standby may already be under load. Our job is to make that window short, explicit, and tested. Sometimes this means changing cluster configuration. Sometimes it means changing service readiness, storage dependencies, monitoring, or the maintenance procedure around the cluster.

The work is deliberately conservative. We do not rewrite a working cluster for fashion, replace Pacemaker because a newer orchestrator is popular, or pretend that every outage is solved by another node. We inspect what exists, find the assumptions that are no longer true, and make changes that your operators can understand at 03:00. A cluster nobody understands is not highly available; it is a delayed incident.

The best deliverable is confidence with boundaries. The team should know which failures are handled automatically, which failures require a human, what command sequence is safe, and which graphs prove that a failover completed inside budget. High availability is not magic uptime. It is the removal of ambiguous time.