The API was fast most of the day and confusing under pressure. Batch jobs, retries, and live user requests shared too many resources, so a harmless report could become part of the interactive latency path. Nothing crashed. The service simply became slow in a way that was hard to explain to customers.
We mapped classes of work, queue depth, database access patterns, retry behavior, and the alerts operators used during busy periods. The main issue was not raw capacity. It was missing priority. The system accepted too much work, then made live requests wait behind jobs that could have been delayed or rejected early.
The engagement separated queues, capped expensive paths, added cheap rejection for non-critical work, and wrote down the overload behavior product teams could support. Dashboards changed from host averages to policy signals: which queue was protected, what was being shed, and whether users were still inside the agreed latency budget.