07 - Service

production recovery consultant for urgent systems work

A production recovery consultant is useful when production is unstable, the room is noisy, and the team needs an experienced operator to help make the next safe decision. MSMSoft joins incidents or post-incident recovery work for Linux, high-availability, high-load, observability, cloud, and dependency failures where the priority is to stabilize service without making the damage larger.

We work calmly from evidence: what changed, what customers see, which dependencies are healthy, what can be rolled back, and what must be isolated. The engagement can start during an active incident or immediately after service returns, with a focus on recovery actions, root-cause clarity, and preventing the same failure from becoming routine.

Start an engagement

When you need a production recovery consultant

An incident keeps reopening because rollback, failover, or capacity relief helped only temporarily.
Teams disagree about root cause while customers are still seeing timeouts, partial failures, or stale data.
A dependency must be isolated quickly, but nobody is sure what else will break when it is removed.
Logs, metrics, and alerts are noisy enough that operators are following the loudest symptom instead of the first cause.
You need outside incident discipline without handing control of production to someone who does not understand the risk.

How we work

Establish a short incident timeline, current customer impact, recent changes, active mitigations, and unsafe actions to avoid.
Triage dependencies, capacity, health checks, deployment state, queues, data stores, and network paths for the fastest stabilizing move.
Prefer reversible actions: rollback, traffic shift, rate limit, queue pause, feature disable, isolation, or failover with explicit verification.
After stabilization, reconstruct root cause with evidence and identify the monitoring or runbook gap that delayed recovery.
Create a recovery report with permanent fixes, owner handoff, and drills for the failure path that just occurred.

Selected work

2025

Quote latency, tail cut from 4 ms to 0.6 ms

A trading platform was losing time in the host path after a kernel update. The NIC was not the bottleneck.

Pinned IRQs, corrected queue affinity, and removed a misleading autoscaling rule from the incident path.

2024

Cluster failover, 14-month clean run

A revenue-critical service had intermittent primary-node failures during maintenance windows.

Reworked health checks, fencing, and failover timing so traffic moved before user-visible failure.

Related field notes

HAA failover that waited 480 ms longer than planned6 min observabilityWhen one Prometheus recording rule hid the regression7 min linuxReading service logs across hosts without panicking8 min

Production recovery consultant work is different from normal advisory work because the clock is real. The first job is not to be clever. It is to reduce harm, make the system smaller, and help the team choose actions that can be verified quickly. In a live incident, the wrong expert can become another source of pressure. We keep the work grounded: one timeline, one current impact statement, one next action, and one rollback path.

Stabilization may mean moving traffic, disabling an expensive feature, pausing a queue, restoring a previous configuration, isolating a dependency, changing a timeout, or proving that a suspected layer is not involved. We avoid irreversible changes unless the team explicitly chooses that risk. When production is loud, reversibility is a feature.

After recovery, the temptation is to write a clean story too quickly. Incidents often have more than one contributing factor: a deploy exposed a latent storage issue, a cache miss pattern triggered database pressure, a health check passed while the service was useless, or an alert fired only after customers were already affected. We reconstruct the chain with evidence and separate trigger, cause, amplifier, and delayed detection.

The final value is not a blame-free paragraph nobody reads. It is a tighter operating system for the next bad day: better rollback notes, clearer ownership, safer dependency isolation, a monitoring change that would have caught the symptom earlier, and a drill that proves the team can recover without improvising every step.