The first clue was not an alert. It was a person saying, "It feels slower after the deploy." That sentence is not evidence, but it is often where evidence starts.

The dashboard disagreed. The p95 looked stable. Error rate was boring. CPU had moved a little but nothing dramatic. The new release had passed its canary and the team was already looking at application code because the graphs said the platform was fine.

The problem was a recording rule. It had been added for a good reason: the raw query was expensive, people used it often, and Prometheus deserved a break. The rule precomputed latency by service and endpoint. Clean labels, fast panels, no drama.

Except the service had changed shape. A new path carried a smaller percentage of traffic but much more expensive work. The raw histogram still had the detail. The recording rule flattened it into a friendly service-level number. The regression was real, but the default dashboard had sanded off the edge that would have shown it.

This is the part where people like to blame tools. The tool was fine. The rule was doing what it said. The failure was that nobody owned the meaning of the number anymore. It had started as an optimization and slowly became truth.

We went back to the raw metrics and split the view by the labels that still mattered: endpoint class, queue, status, and caller. The slow path appeared immediately. It was not huge in volume, so it barely moved the service aggregate. But for the customers who hit it, the regression was obvious.

The fix was two-part. First, repair the product issue. In this case, a database lookup moved into a path where the cache miss rate was worse than expected. Second, repair the observability contract. The recording rule stayed, but it stopped pretending to answer questions it could not answer. The dashboard got a small set of raw escape hatches for tail work, and the alert moved closer to user-visible behavior.

The lesson is not "never aggregate." You have to aggregate or nobody will read the screen. The lesson is to keep a way back to the messy truth. A clean graph is useful only while it still has a path to the detail it hides.