08 - Service

observability audit Prometheus work for production teams

An observability audit Prometheus engagement is useful when dashboards are green, incidents are not, and the team suspects its metrics, logs, traces, or alerts no longer describe production. MSMSoft reviews Prometheus and adjacent observability systems for teams that need signal they can trust during incidents, not decorative panels for status meetings.

We inspect what is measured, what is aggregated away, which labels create cardinality risk, which alerts wake people, and whether logs can reconstruct a real failure timeline. The result is a focused remediation plan: safer recording rules, better alert contracts, lower-noise dashboards, and runbooks that connect telemetry to operator decisions.

Start an engagement

When you need a observability audit Prometheus

Prometheus is healthy, but recording rules or aggregates hide the slow path customers feel.
Cardinality growth threatens retention, query cost, or scrape reliability, and nobody knows which labels are worth keeping.
Alerts fire late, fire constantly, or describe symptoms without telling operators what decision is needed.
Logs are available but difficult to correlate across hosts, services, deploys, queues, and customer impact.
Dashboards answer what happened yesterday, but not what to do during a live incident.

How we work

Review Prometheus targets, rules, alert routing, dashboard assumptions, logs, trace coverage, retention, and recent incident history.
Trace two or three user-visible failure modes through the telemetry to see where detail is missing or misleading.
Audit labels, histogram buckets, recording rules, and expensive queries for cardinality and meaning, not only cost.
Rewrite alerts around user impact, actionable ownership, and clear thresholds with explicit quiet conditions.
Deliver an observability contract: what each key signal means, where raw truth lives, and how operators should use it.

Selected work

2025

Quote latency, tail cut from 4 ms to 0.6 ms

A trading platform was losing time in the host path after a kernel update. The NIC was not the bottleneck.

Pinned IRQs, corrected queue affinity, and removed a misleading autoscaling rule from the incident path.

2024

High-load API path made predictable

A customer-facing API had unpredictable tail latency whenever batch jobs and live traffic overlapped.

Separated queues, capped expensive work, documented overload behavior, and reduced manual intervention.

Related field notes

observabilityWhen one Prometheus recording rule hid the regression7 min linuxReading service logs across hosts without panicking8 min observabilityHow to read p99 latency during a partial outage9 min

Observability audit Prometheus work begins with skepticism toward every clean graph. Aggregation is necessary, but it can become a lie when nobody owns what the number means. A dashboard may show stable p95 while a low-volume endpoint burns a small but important set of customers. A recording rule may make queries cheap while hiding the labels needed for diagnosis. An alert may be mathematically correct and operationally useless.

We look at telemetry as part of the production system. Prometheus scrape intervals, histogram buckets, relabeling, recording rules, retention, alert routing, log formats, trace sampling, and dashboard layouts all shape incident response. The goal is not more data. The goal is the right escape hatches: enough raw detail to investigate, enough aggregation to operate, and enough documentation to know when a panel is allowed to be trusted.

Cardinality is treated as both a cost problem and a meaning problem. Removing every label makes metrics cheap and blind. Keeping every label makes systems expensive and fragile. We identify labels that carry diagnostic value, labels that create unbounded growth, and places where exemplars, logs, or traces are a better home for detail than a metric series. The recommendation is practical because it is tied to incidents the team actually sees.

We also review alerts with an operator’s patience in mind. A page should imply a decision: rollback, failover, capacity action, dependency escalation, or investigation with a specific starting point. If an alert only says something is odd, it may belong on a dashboard, not in the middle of the night. After the audit, teams should know which signals represent user impact, which signals represent causes, and which signals are merely consequences.