Quote latency, tail cut from 4 ms to 0.6 ms

The team asked for a network review because quote latency had moved after a routine kernel update. Packet loss was not visible, CPU averages looked comfortable, and the NIC counters were clean. The problem was in the host path: interrupts, queue placement, and a scheduling change had turned a predictable path into one with short but expensive stalls.

The engagement started with measurement rather than tuning. We compared per-queue latency, IRQ placement, softirq time, CPU isolation, and the exact route packets took through the service. The slow tail appeared only when a busy application thread and an unlucky interrupt queue shared a core. Autoscaling metrics hid the issue because the aggregate host looked healthy.

The fix was small and documented. IRQs were pinned to the right cores, receive queues were matched to the service layout, and the autoscaling rule that pointed responders at the wrong layer was removed from the runbook. The platform kept the kernel update and regained the expected latency envelope.