Linux & systems engineering
Kernel parameters, storage paths, scheduler behavior, systemd units, and the host-level details that decide whether a product stays predictable.
We design, tune, and recover the infrastructure under your product. Linux, HA clusters, high-load systems, cloud platform choices, observability, and the work that runs at 3 AM when the dashboards go red.
Nine areas of work. We are not a 40-person agency with a long menu. These are the systems where experienced diagnosis changes the outcome.
Kernel parameters, storage paths, scheduler behavior, systemd units, and the host-level details that decide whether a product stays predictable.
Pacemaker and Corosync clusters, fencing, failover drills, split-brain prevention, and recovery procedures that have actually been exercised.
Latency budgets, profiling, packet captures, flame graphs, IRQ affinity, and the measurement discipline needed before changing production.
AWS, GCP, Azure, DigitalOcean, bare metal, or hybrid: selection based on failure modes, cost shape, latency, operations maturity, and lock-in risk.
Architecture reviews that identify single points of failure, narrow blast radius, isolate dependencies, and turn failover behavior into an explicit design.
Capacity planning, queue behavior, cache pressure, database hot paths, and load-shedding designs for systems that must stay boring under traffic.
Incident stabilization, root-cause analysis, rollback paths, dependency isolation, and the operational decisions that get a service back under control.
Monitoring, alerting, log management, metrics, traces, and dashboards built around failure detection, incident response, and operational decisions.
Operational automation, deployment mechanics, observability, runbooks, and the small platform decisions that remove repeated human work.
A trading platform was losing time in the host path after a kernel update. The NIC was not the bottleneck.
Pinned IRQs, corrected queue affinity, and removed a misleading autoscaling rule from the incident path.
A revenue-critical service had intermittent primary-node failures during maintenance windows.
Reworked health checks, fencing, and failover timing so traffic moved before user-visible failure.
A customer-facing API had unpredictable tail latency whenever batch jobs and live traffic overlapped.
Separated queues, capped expensive work, documented overload behavior, and reduced manual intervention.
Deep systems work is not a slide deck. We stay close to the logs, alerts, rollout mechanics, and failover paths that decide whether a system recovers cleanly.
We publish what we can: logs, failure patterns, and operational notes from systems that made people think twice.
View all writingA short email is enough. Include what changed, what failed, what you tried, and the first log line that made you stop.
To: [email protected] Subject: [P1] failover loop on cluster-02 PGP key: 9F4A 88E1 6C2D 41B7 0FA3 Status: https://status.msmsoft.com
Engagement intake