2026-04-12A failover that waited 480 ms longer than plannedThe cluster worked. The service recovered. The ugly part was the half second nobody had budgeted for.High availability6 min
2026-03-28When one Prometheus recording rule hid the regressionThe dashboard was green because the query was polite. Production was not.Observability7 min
2026-02-09When load shedding becomes the product behaviorOverload handling is not an implementation detail. Under pressure, it is the product your users actually get.High load5 min
2025-11-18Reading service logs across hosts without panickingDuring an incident, logs are not a novel. They are a crime scene with bad lighting.Linux8 min
2026-05-20Pacemaker vs Patroni for PostgreSQL high availabilityA practical comparison of Pacemaker and Patroni for PostgreSQL failover, with tradeoffs around fencing, consensus, routing, operations, and recovery.High availability18 min
2026-05-20Anatomy of an outage: how production incidents really unfoldA long-form field guide to outage mechanics: detection, triage, blast radius, mitigation, communication, postmortems, and the contracts that keep incidents small.Production recovery35 min
2026-05-21PostgreSQL failover checklist before maintenanceA practical PostgreSQL failover checklist for maintenance windows: replication lag, fencing assumptions, client routing, rollback conditions, and proof of recovery.High availability9 min
2026-05-21How to read p99 latency during a partial outageA concrete p99 latency troubleshooting sequence for partial outages where averages look calm and real users are still waiting.Observability9 min
2026-05-21Load shedding design patterns for APIsA practical guide to API load shedding design: admission control, queue separation, cheap rejection, retry hints, brownout modes, and observability.High load10 min