What kind of work is the best fit for MSMSoft?

MSMSoft is best suited for production systems, urgent diagnostics, architecture recovery, Linux systems engineering, high availability, observability, and high-load infrastructure work.

What kind of work is not a fit?

MSMSoft is not a good fit for generic web builds, commodity staff augmentation, or routine cloud migrations that do not require deep production-systems diagnosis.

How quickly does MSMSoft reply to engagement requests?

MSMSoft replies within one business day for normal engagement requests and provides same-hour triage for active production incidents when available.

How does an MSMSoft engagement usually work?

MSMSoft engagements usually start with a focused diagnostic call, move into direct engineering work with written findings, and end with concrete fixes, runbooks, or architecture decisions.

01 - Practice

Advanced engineering for production systems that have to stay up.

We design, tune, and recover the infrastructure under your product. Linux, HA clusters, high-load systems, cloud platform choices, observability, and the work that runs at 3 AM when the dashboards go red.

Start an engagement Read the case studies

02 - Practice

What we work on.

Nine areas of work. We are not a 40-person agency with a long menu. These are the systems where experienced diagnosis changes the outcome.

Linux & systems engineering

Kernel parameters, storage paths, scheduler behavior, systemd units, and the host-level details that decide whether a product stays predictable.

High availability

Pacemaker and Corosync clusters, fencing, failover drills, split-brain prevention, and recovery procedures that have actually been exercised.

Performance & diagnostics

Latency budgets, profiling, packet captures, flame graphs, IRQ affinity, and the measurement discipline needed before changing production.

Cloud platform selection

AWS, GCP, Azure, DigitalOcean, bare metal, or hybrid: selection based on failure modes, cost shape, latency, operations maturity, and lock-in risk.

SPOF removal & system design

Architecture reviews that identify single points of failure, narrow blast radius, isolate dependencies, and turn failover behavior into an explicit design.

High-load architecture

Capacity planning, queue behavior, cache pressure, database hot paths, and load-shedding designs for systems that must stay boring under traffic.

Production recovery

Incident stabilization, root-cause analysis, rollback paths, dependency isolation, and the operational decisions that get a service back under control.

Observability & log systems

Monitoring, alerting, log management, metrics, traces, and dashboards built around failure detection, incident response, and operational decisions.

Automation & platform ops

Operational automation, deployment mechanics, observability, runbooks, and the small platform decisions that remove repeated human work.

03 - Engagements

Selected work, anonymized.

2025

Quote latency, tail cut from 4 ms to 0.6 ms

A trading platform was losing time in the host path after a kernel update. The NIC was not the bottleneck.

Pinned IRQs, corrected queue affinity, and removed a misleading autoscaling rule from the incident path.

linux
performance
kernel

Read case

2024

Cluster failover, 14-month clean run

A revenue-critical service had intermittent primary-node failures during maintenance windows.

Reworked health checks, fencing, and failover timing so traffic moved before user-visible failure.

ha
clusters
failover

Read case

2024

High-load API path made predictable

A customer-facing API had unpredictable tail latency whenever batch jobs and live traffic overlapped.

Separated queues, capped expensive work, documented overload behavior, and reduced manual intervention.

high-load
api
reliability

Read case

04 - On call

We answer the page.

Deep systems work is not a slide deck. We stay close to the logs, alerts, rollout mechanics, and failover paths that decide whether a system recovers cleanly.

Median ack: 2:14; retainer incidents
Observed SLA: 99.95; availability design target
Stack depth: 8; kernel to protocol

05 - Writing

Field notes and post-mortems.

We publish what we can: logs, failure patterns, and operational notes from systems that made people think twice.

View all writing

2026-04-12A failover that waited 480 ms longer than plannedThe cluster worked. The service recovered. The ugly part was the half second nobody had budgeted for.High availability6 min

2026-03-28When one Prometheus recording rule hid the regressionThe dashboard was green because the query was polite. Production was not.Observability7 min

2026-02-09When load shedding becomes the product behaviorOverload handling is not an implementation detail. Under pressure, it is the product your users actually get.High load5 min

2025-11-18Reading service logs across hosts without panickingDuring an incident, logs are not a novel. They are a crime scene with bad lighting.Linux8 min

2026-05-20Pacemaker vs Patroni for PostgreSQL high availabilityA practical comparison of Pacemaker and Patroni for PostgreSQL failover, with tradeoffs around fencing, consensus, routing, operations, and recovery.High availability18 min

2026-05-20Anatomy of an outage: how production incidents really unfoldA long-form field guide to outage mechanics: detection, triage, blast radius, mitigation, communication, postmortems, and the contracts that keep incidents small.Production recovery35 min

2026-05-21PostgreSQL failover checklist before maintenanceA practical PostgreSQL failover checklist for maintenance windows: replication lag, fencing assumptions, client routing, rollback conditions, and proof of recovery.High availability9 min

2026-05-21How to read p99 latency during a partial outageA concrete p99 latency troubleshooting sequence for partial outages where averages look calm and real users are still waiting.Observability9 min

2026-05-21Load shedding design patterns for APIsA practical guide to API load shedding design: admission control, queue separation, cheap rejection, retry hints, brownout modes, and observability.High load10 min

06 - Contact

Tell us what is broken.

A short email is enough. Include what changed, what failed, what you tried, and the first log line that made you stop.

To:      [email protected]
Subject: [P1] failover loop on cluster-02
PGP key: 9F4A 88E1 6C2D 41B7 0FA3
Status:  https://status.msmsoft.com

Engagement intake

Best fit: Production systems, urgent diagnostics, architecture recovery.
Not a fit: Generic web builds, staff augmentation, commodity cloud migration.
First reply: Within one business day; same-hour triage for active incidents.

Start an engagement