01 - Service

linux performance consultant for production systems

A linux performance consultant is useful when the application team has already optimized the obvious code path and production is still slow, noisy, or unpredictable. MSMSoft works with infrastructure teams that need host-level diagnosis across kernels, storage, networking, schedulers, cgroups, and systemd, then turns the findings into changes operators can safely own.

The engagement is practical: we start with evidence from the failing hosts, define the performance contract that matters to the business path, and separate symptoms from causes before changing production. You get a short list of safe fixes, risky experiments, rollback notes, and the measurements that prove whether the change helped.

Start an engagement

When you need a linux performance consultant

Tail latency moved after a kernel, hypervisor, or instance-family change, but CPU averages still look normal.
Network or disk interrupts are landing on the wrong cores and the application is blamed for scheduler noise.
systemd units restart cleanly in staging but hang, flap, or leave stale resources during production incidents.
Storage queues, page cache pressure, NUMA placement, or cgroup limits behave differently under live traffic than under benchmarks.
You need an outside operator to read perf, flame graphs, packet captures, journal logs, and host counters without turning the incident into a tool demo.

How we work

Build a timeline from host telemetry, deploy history, kernel and package changes, traffic shape, and first user-visible symptoms.
Measure the disputed path with the least invasive tools first: sar, perf, eBPF where appropriate, packet captures, scheduler data, and service logs.
Reduce the problem to testable claims: IRQ placement, run queue delay, lock contention, disk wait, allocator behavior, throttling, or dependency back pressure.
Make small changes with explicit rollback: affinity, queue sizing, kernel parameters, service limits, unit dependencies, or safer failure behavior.
Leave behind a runbook that says what was changed, what was rejected, and which graph or command should prove the next regression.

Selected work

2025

Quote latency, tail cut from 4 ms to 0.6 ms

A trading platform was losing time in the host path after a kernel update. The NIC was not the bottleneck.

Pinned IRQs, corrected queue affinity, and removed a misleading autoscaling rule from the incident path.

Related field notes

linuxReading service logs across hosts without panicking8 min

Linux performance consultant work is not a replacement for application ownership. It is the part of production engineering that asks whether the host is telling the truth. We look at interrupts, queues, caches, scheduler delay, kernel defaults, service supervision, and the boundary between bare metal, virtual machines, and containers. The goal is not to collect exotic commands. The goal is to decide which layer is allowed to be slow and which layer is only being accused.

Most teams call after a few reasonable attempts have failed. Someone changed an instance type, applied a kernel update, moved traffic to a new NIC, added a sidecar, or tightened container limits. The graphs are close enough to normal that the incident keeps reopening. We bring a disciplined outside view: reproduce the symptom where possible, protect production where necessary, and avoid global tuning until one failing path has been measured end to end.

A good outcome is often boring. An interrupt moves. A queue stops sharing a core with the hottest thread. A unit gets a real readiness dependency. A kernel parameter is removed because it was copied from an old post and no longer applies. The important part is that the team knows why. We will not leave you with a mystery sysctl, a dashboard nobody trusts, or a benchmark that only passes when customers are absent.

We also say no when Linux is not the problem. If the evidence points to a database plan, application lock, downstream timeout, or product overload policy, the report says that clearly. Host-level work is valuable because it narrows the search. Sometimes it fixes the incident. Sometimes it proves the host is innocent and gives the application team the detail they need to move.