Why can p99 latency be high when averages look normal?

Averages mix fast and slow cohorts. A small but important path can be badly degraded while the service-wide average remains acceptable.

What should I split p99 latency by first?

Start with the user-visible route or endpoint, then split by caller, tenant, region, status, queue, and host depending on where the symptom appears.

Is p99 always reliable during an outage?

No. Percentiles depend on sample size, buckets, scrape windows, and whether canceled requests are recorded. Use p99 to find examples and then confirm with logs, traces, and saturation signals.

How to read p99 latency during a partial outage

P99 latency troubleshooting starts with an uncomfortable fact: a partial outage can be real while the headline dashboard looks normal. Average latency is calm. Error rate is low. The global p95 is acceptable. Support is still receiving screenshots from users who waited long enough to abandon the product. The dashboard is not lying; it is answering a question that is too broad.

The first move is to name the user-visible symptom. Which journey is slow: login, checkout, search, export, dashboard load, API write, webhook delivery? P99 for the whole service is a blunt instrument. P99 for the affected journey is evidence. If the slow cohort is ten percent of traffic, an aggregate panel can make pain look like noise.

1. Split by endpoint or route class. Do not start with host CPU. Start with the product path. Compare latency for interactive endpoints, background callbacks, admin reports, and health checks. Health checks often stay fast because they avoid the slow dependency. A green health check next to a slow checkout path is not contradiction; it is a clue.

2. Split by caller, tenant, region, and device shape. Partial outages often follow boundaries: one large tenant, one API client with unusual retries, one region, one mobile app version, one report type, or one customer using older data. If labels are too expensive for the primary dashboard, use logs or traces to sample the affected path directly.

3. Compare successful slow requests with failed requests. A timeout is not the only failure. Slow success can be worse because it consumes workers, holds locks, and encourages clients to retry while the server is still doing the original work. Track latency, timeout count, cancellation count, and client disconnects together. Otherwise the system appears successful while it is wasting capacity.

4. Look at queue age, not only queue depth. Tail latency often enters the system before the request handler starts. A short queue with old work means the system is stuck behind a small number of expensive tasks. A large queue with fresh work means admission is overwhelmed. Both can produce p99 pain, but they require different mitigation.

5. Check retry amplification. During partial outages, clients and workers may multiply the slow path. Look for repeated request IDs, identical payloads, synchronized retry intervals, connection pool churn, and rising request volume without rising user traffic. If retries are amplifying load, the fastest mitigation may be backoff, shedding, or disabling the caller before optimizing the dependency.

6. Separate dependency latency from local saturation. A request may be slow because the database is slow, because the service is waiting for a connection, because the host is CPU throttled, because a lock is contended, or because the process is stuck behind garbage collection. P99 is the symptom. Saturation tells you where work waited.

7. Use raw histograms when recording rules are suspicious. Recording rules are useful until they flatten the boundary you need. If the aggregate series hides endpoint, status, queue, or caller labels, go back to raw buckets for the incident window. Compare the last known healthy period with the current period. The question is not only how high p99 is, but which distribution changed shape.

8. Align metrics with logs and traces on a narrow timeline. Pick a five- or ten-minute window around the first user-visible symptom. Pull example request IDs from logs. Check whether traces show application time, database time, queue wait, DNS, TLS, proxy wait, or downstream latency. If traces are missing for the affected path, that missing coverage is itself part of the diagnosis.

9. Avoid percentile theater. P99 is not a magic truth. It depends on sample size, bucket boundaries, scrape interval, client timeouts, and whether canceled requests are measured. During low traffic, p99 may be mathematically jumpy. During very high traffic, a small cohort can be hidden. Use p99 as a pointer to examples, not as the only evidence.

10. Choose mitigation based on where waiting enters. If waiting is in a queue, pause low-priority producers or increase workers carefully. If waiting is in a database pool, reduce concurrency or isolate expensive work. If waiting is in a downstream API, add timeout discipline or degrade the feature. If waiting is caused by retry storms, reduce retries before adding capacity.

11. Verify recovery with cohort-specific panels. Do not declare success because the global p99 dropped. Check the affected endpoint, tenant, region, queue age, retry rate, and error budget. Recovery from a partial outage means the painful cohort is healthy, not merely diluted by normal traffic.

The practical habit is to treat p99 latency as a door, not a destination. Walk through it into endpoint splits, host splits, queue age, retries, saturation, logs, and traces. The incident becomes smaller when the team can say where the delay entered the system and which users were harmed. That is the difference between staring at a percentile and troubleshooting production.

12. Check the client timeout boundary. If clients give up at two seconds and the server records successful responses at three seconds, the server-side p99 will understate user pain. Canceled requests, broken pipes, and downstream work that continues after disconnect all matter. Tail latency should be read from both sides of the connection whenever possible.

13. Inspect host-level outliers. A partial outage can be one bad node behind a load balancer, one noisy neighbor, one hot shard, one container with CPU throttling, or one process with a large heap. Split latency by instance and compare request count, saturation, garbage collection, disk wait, and network retransmits. If one host is responsible for the tail, global service metrics will dilute the evidence.

14. Look for lock and pool contention. Many p99 incidents are waiting problems disguised as compute problems. Database connection pools, thread pools, worker pools, mutexes, and per-tenant limits can all create a long tail while average utilization stays modest. The important question is not whether the service is busy; it is whether the affected work is waiting behind the wrong gate.

15. After mitigation, keep the widened view for a while. Partial outages often return when retries resume, caches expire, or background jobs catch up. Keep the cohort-specific panels open through recovery and compare them with the global graphs. If the original painful cohort is no longer visible in the dashboard, add that view permanently before the incident fades.

16. Be careful with dashboards that drop timeout samples. Some client libraries report only completed requests, while the worst user experiences disappear as cancellations. Compare server-side measurements with load balancer logs, browser timings, synthetic checks, and client telemetry when available. Tail latency work is about user waiting time, not only handler duration.

17. Preserve a few concrete examples for the postmortem. Keep request IDs, tenant IDs if safe, timestamps, trace links, log excerpts, and the before-and-after panels that proved the fix. Partial outages are easy to rewrite as vague slowness after the fact. Examples keep the team honest and turn the next p99 investigation into a shorter one.

The durable fix is usually a new view, not a prettier average. Add the endpoint, cohort, timeout, retry, and saturation panels that would have made the partial outage obvious in the first ten minutes. If the next responder has to rediscover the same split by hand, the incident produced knowledge but the system did not keep it.

FAQ