Pacemaker and Patroni are often compared as if they were two brands of the same appliance. That framing creates bad decisions. Pacemaker is a general cluster manager that can supervise many kinds of resources, enforce ordering, use quorum, and coordinate fencing. Patroni is a PostgreSQL high availability controller that uses a distributed consensus store to decide which database instance should be leader. Both can keep PostgreSQL available. They arrive there through different operational contracts.

The useful question is not which tool is modern. The useful question is which failure model your team can actually operate at 03:00. A database outage is not only a software event. It is a sequence of detection, decision, fencing or demotion, promotion, client routing, application retry behavior, data-loss policy, and post-failover cleanup. The right tool is the one that makes those steps explicit enough for your environment.

Pacemaker grew from the world of Linux clusters. It expects resources, constraints, agents, monitors, colocation rules, ordering rules, and often STONITH fencing. That makes it broad and sometimes intimidating. It can manage a virtual IP, a filesystem, a PostgreSQL instance, a replication slot helper, and a service dependency as one graph. The cost is that somebody must understand the graph.

Patroni grew from the PostgreSQL world. It knows about PostgreSQL replication, leader locks, promotion, demotion, timelines, rewind, synchronous mode, tags, callbacks, and REST health endpoints. It delegates consensus to etcd, Consul, ZooKeeper, or Kubernetes. That makes it feel more database-native. The cost is that your high availability story now includes the consensus store, client routing layer, and the exact PostgreSQL behavior Patroni is orchestrating.

If the cluster protects more than PostgreSQL, Pacemaker may be the more natural fit. Some environments need to move an IP address with the database, mount storage in a strict order, start a local proxy, and prevent a second node from touching shared disks. Pacemaker was built for that orchestration. Patroni can call scripts and integrate with load balancers, but it is not a general-purpose resource manager.

If PostgreSQL is the central resource and the rest of the stack can discover the current leader through a proxy, service discovery, or Kubernetes service, Patroni may be simpler. Operators can reason in database terms: leader, replica, synchronous standby, replication lag, rewind, timeline, and failover candidate. The mental model is closer to what PostgreSQL already exposes.

The biggest Pacemaker advantage is explicit fencing. In a split-brain-sensitive environment, being able to say that a node was cut off from storage, power, or the network before another node takes over is valuable. Pacemaker can be configured badly, but when fencing is treated seriously, the cluster has a hard safety boundary. Many uncomfortable failovers become easier to discuss because the cluster is not merely hoping the old primary stepped aside.

The biggest Patroni advantage is PostgreSQL-specific intent. It understands promotion and demotion workflows better than a generic resource agent can by itself. It exposes API state that proxies can query. It has patterns for reinitializing replicas, using pg_rewind, handling replication lag, and controlling synchronous replication. Teams that live inside PostgreSQL every day often find this easier to audit than a broad cluster configuration.

Quorum looks different in the two systems. Pacemaker and Corosync have their own cluster membership and quorum semantics. Patroni relies on a distributed configuration store for leader election. Neither removes the need to design for partitions. You still need to know what happens when a database node can reach clients but not peers, when the consensus layer is slow, or when only the old primary can see a storage volume.

Client routing is usually where comparisons become practical. Pacemaker deployments often move a virtual IP or manage a local proxy as a clustered resource. Applications keep connecting to the same endpoint. Patroni deployments often pair with HAProxy, PgBouncer, Envoy, DNS, Kubernetes services, or cloud load balancers that check Patroni's REST API. The second model can be cleaner in dynamic infrastructure, but it adds another component whose health semantics matter.

Operational familiarity matters more than fashion. A team with years of Pacemaker practice, tested fence devices, and written failover drills should not migrate just because Patroni sounds newer. A team with PostgreSQL expertise, Kubernetes primitives, and little appetite for generic cluster constraints should not force Pacemaker into the design because an old runbook says database HA equals floating IP plus STONITH.

Pacemaker can become a maze when every exception is encoded as another constraint. The cluster may be correct and still hard to change safely. Patroni can become fragile when the consensus store is treated as magic, the proxy checks are too shallow, or replication lag policy is copied from an example. In both cases, the real risk is not the tool; it is an unowned failure contract.

Data loss policy should be decided before tool choice. If the business requires near-zero loss, synchronous replication and quorum choices become central. Patroni has direct controls for synchronous modes, but those modes have availability tradeoffs. Pacemaker can supervise PostgreSQL setups that use synchronous replication too, but the policy lives partly in PostgreSQL and partly in cluster behavior. Write the RPO and RTO first, then map the tool to it.

Recovery after failover deserves the same attention as failover itself. Patroni's PostgreSQL-aware workflows can make former primary handling cleaner, especially with pg_rewind. Pacemaker environments can be safe too, but they need careful resource agents, fencing, and operator procedure. The painful incidents often happen after traffic has moved, when someone tries to bring the old node back and discovers that the cluster state, replication state, and human story disagree.

Testing should include ugly failures, not only the happy demo. Kill the PostgreSQL process. Freeze disk writes. Partition the database from the consensus store. Delay the proxy checks. Make the primary responsive to SSH but not useful to PostgreSQL clients. Restart the standby under load. Run the operator command twice. The comparison between Pacemaker and Patroni becomes much clearer when both are asked to handle the failures you actually fear.

For small teams, Patroni often wins when PostgreSQL is the only clustered thing and a managed consensus layer or Kubernetes already exists. It gives database-shaped behavior with less generic cluster vocabulary. For infrastructure teams already operating Linux clusters, Pacemaker can still be the honest choice, especially when fencing, shared resources, or multi-resource ordering are non-negotiable.

For regulated or conservative environments, the answer may be whichever design can be evidenced. Auditors and risk owners do not need a trendy diagram. They need proof of failover testing, split-brain prevention, backup and restore validation, access control, monitoring, and rollback procedure. Pacemaker can produce that evidence. Patroni can produce that evidence. The weaker design is the one whose evidence depends on one engineer's memory.

A useful decision record should say: what PostgreSQL version is in scope, what infrastructure hosts it, what consensus or cluster membership mechanism exists, how clients find the leader, what fencing or demotion prevents split brain, what replication mode protects data, what failures were tested, and what operators do when automation refuses to decide. Without that record, the tool name is trivia.

The MSMSoft bias is toward boring, testable contracts. Choose Pacemaker when you need a broad Linux cluster manager with strong fencing and resource orchestration your team understands. Choose Patroni when you want PostgreSQL-native leadership and your routing and consensus layers are first-class citizens. In both cases, spend less energy winning the tool argument and more energy removing the ambiguous minutes around failure.

If you are unsure, run a design review before migration. Build two thin prototypes and score them against your actual incident scenarios. Count components, operator actions, recovery commands, monitoring signals, and rollback paths. The result may surprise both camps. The best high availability design is not the one with the most impressive architecture diagram. It is the one your team can safely operate when the primary is lying.

Отдельно оцените, кто владеет изменениями. Pacemaker обычно требует уверенного Linux-оператора, который понимает Corosync, resource agents, fencing topology, ordering constraints и последствия ручного cleanup. Patroni обычно требует уверенного PostgreSQL-оператора, который понимает WAL, timelines, replication slots, synchronous commit, pg_rewind и состояние DCS. Если эти компетенции живут в разных командах, инструмент может стать границей конфликта, а не решением высокой доступности.

Стоимость эксплуатации тоже различается. Pacemaker чаще требует дисциплины вокруг узлов, fence devices, версии агентов и документации к ресурсам. Patroni чаще требует дисциплины вокруг etcd или Consul, health-check контрактов, прокси, сетевых политик и поведения клиентов при смене лидера. Нельзя сравнивать только количество конфигурационных строк. Сравнивайте количество компонентов, которые должны быть живыми, понятными и проверенными во время аварии.

Важный практический критерий — как система отказывается от автоматического решения. Хороший HA-контур должен уметь сказать: состояние неоднозначно, автоматическое переключение небезопасно, нужен человек. Pacemaker делает это через quorum, fencing и состояние ресурсов. Patroni делает это через лидерский lock, доступность DCS, теги кандидатов и ограничения PostgreSQL. В обоих случаях отказ автоматики должен быть виден в мониторинге и описан в runbook.

Не забывайте о read traffic. Многие PostgreSQL-кластеры обслуживают не только запись на лидере, но и чтение с реплик. Patroni дает удобные признаки роли и lag, которые прокси может использовать для маршрутизации. Pacemaker может управлять отдельными endpoint-ами, но часто требует больше явной интеграции. Если чтение с реплик важно для продукта, включите lag, stale reads и поведение отчетов в сравнение, а не только сценарий падения primary.

Backup and restore должны быть частью выбора. Высокая доступность не заменяет восстановление данных. Patroni может аккуратно переинициализировать реплики, но он не спасает от ошибочного DELETE, поврежденного бэкапа или неверной retention policy. Pacemaker может быстро вернуть сервис, но тоже не отвечает за смысл данных. Перед выбором проверьте PITR, время восстановления, доступность архивов WAL и процедуру восстановления на отдельном окружении.

Сетевые разделения являются главным экзаменом. Узел может видеть клиентов, но не видеть DCS. Узел может видеть Corosync peers, но иметь сломанный путь к хранилищу. Прокси может видеть REST API, но не проверять реальную возможность записи. Хорошая сравнительная таблица должна включать такие поломки, потому что именно они показывают, где инструмент принимает решение, где ждет, а где перекладывает риск на оператора.

Миграция между подходами требует отдельного плана. Нельзя просто заменить Pacemaker на Patroni или наоборот без изменения мониторинга, маршрутизации клиентов, runbook, процедур обслуживания и обучения дежурных. Начинайте с параллельной модели на staging, затем проверяйте восстановление старого primary, переключение приложений, откат и поведение backup-процессов. Самый опасный период — не новый штатный режим, а переход, когда две mental models существуют одновременно.

There is also a third option that belongs in the comparison: a managed PostgreSQL service. Amazon RDS and Aurora, Google Cloud SQL, Azure Database for PostgreSQL, Crunchy Bridge, and similar platforms do not remove failover design. They move part of the design into a provider contract. That can be the right answer when the business needs a service-level commitment more than it needs control over every cluster mechanism.

Managed service changes the shape of the operational question. You still need RPO, RTO, backup, restore, read routing, maintenance windows, extension policy, connection pooling, and incident communication. What you may not need is ownership of Corosync, STONITH devices, etcd quorum, replica bootstrap scripts, or the exact promotion command. That reduction is valuable if the team is small or if PostgreSQL availability is important but not a differentiating competency.

The tradeoff is that provider behavior becomes part of the failure model. You need to know what the platform does during zone loss, storage impairment, maintenance, certificate rotation, major version upgrade, replica lag, and control-plane degradation. A managed service is not magic; it is an external runbook with contractual boundaries. Read those boundaries before comparing it with Pacemaker or Patroni.

A useful decision record should therefore have three columns, not two. Pacemaker is strongest when explicit Linux-cluster control, fencing topology, virtual IPs, and multi-resource orchestration are central. Patroni is strongest when PostgreSQL-native leader control, DCS-backed promotion, and database-team ownership are central. Managed PostgreSQL is strongest when the organization wants to buy down operational surface area and can accept provider-specific constraints, cost, and black-box behavior.

Итоговая рекомендация должна быть скучной и проверяемой. Если вы выбираете Pacemaker, запишите fence path, quorum policy, managed resources, команды cleanup и условия ручного вмешательства. Если выбираете Patroni, запишите DCS SLA, proxy health checks, synchronous mode, failback policy и ограничения кандидатов. После этого проведите drill и обновите решение по результатам. Без такого цикла сравнение остается мнением, а не engineering decision.

Проверяйте observability до продакшена. Для Pacemaker нужны алерты на quorum, failed actions, fencing latency, resource stickiness и unexpected location. Для Patroni нужны алерты на DCS latency, leader changes, replication lag, failed rewind, unhealthy replicas и несоответствие прокси текущей роли. Если мониторинг не различает безопасную паузу и опасную неопределенность, дежурный узнает о проблеме от приложения, а не от HA-слоя.

Сравнение также должно включать обычное обслуживание. Как выполняется patching узлов, minor upgrade PostgreSQL, rotation сертификатов, замена fence device, обновление etcd, перенос VIP или изменение health-check? Инструмент, который выглядит простым в аварии, может оказаться тяжелым в ежемесячной эксплуатации. А инструмент с более сложной моделью может быть надежнее, если все регулярные операции отрепетированы и записаны.

Не пытайтесь сделать оба подхода одновременно без ясной границы ответственности. Pacemaker, который управляет ресурсом PostgreSQL, и Patroni, который одновременно считает себя владельцем лидера, легко создают две competing authorities. Если нужен переходный период, одна система должна быть authoritative, а другая только наблюдать или обслуживать вспомогательный слой. Два автомата, которые оба имеют право promote, — это не надежность, а split-brain в ожидании удобного момента.