Load shedding design starts with a blunt rule: an overloaded API will reject work somehow. It can reject work deliberately with a cheap, clear response, or accidentally with timeouts, memory pressure, connection pool exhaustion, and retry storms. The product difference is enormous. Intentional load shedding protects the important path. Accidental load shedding lets the slowest bottleneck decide who suffers.
The first pattern is admission control at the edge. Decide whether the system should accept a request before it allocates expensive resources. That decision can use global concurrency, per-tenant budgets, endpoint class, queue age, downstream health, or current error budget. Admission control should be cheaper than the work it rejects. If rejection requires the same database lookup as success, it is too late.
The second pattern is separating classes of work. Interactive requests, background jobs, exports, webhooks, retries, and operator actions should not share one undifferentiated queue. Under pressure, a low-value report should not block login. A webhook replay should not starve checkout. Queue separation makes overload policy visible and gives operators a lever that is more precise than restarting the service.
The third pattern is cheap rejection. A good 503 is fast, explicit, and retry-aware. It avoids expensive dependencies, includes a `Retry-After` hint when retry is useful, and has a response body that clients can classify. A bad 503 arrives after thirty seconds of work, omits retry guidance, and causes every client to retry immediately. That is not load shedding; it is load multiplication.
Good 503 behavior also respects idempotency. Retrying a GET or an idempotent PUT may be reasonable with backoff. Retrying a payment, account mutation, or job submission without an idempotency key can create duplicate work or data damage. The API should document which failures are safe to retry and clients should treat that contract as part of the interface, not as an implementation detail.
The fourth pattern is brownout mode. Instead of rejecting whole requests, the product can disable expensive optional work: recommendations, exports, analytics panels, preview generation, non-critical enrichment, or synchronous notifications. Brownout mode should be visible to users and operators. Silent feature disappearance creates support noise; explicit degradation preserves trust.
The fifth pattern is protecting downstream dependencies. If the database, search cluster, payment provider, or internal service is impaired, the API should not keep sending unlimited work because the frontend is still receiving traffic. Circuit breakers, concurrency limits, bulkheads, and queue age thresholds are ways to keep one slow dependency from consuming the whole process.
The sixth pattern is retry control. Clients should use jittered exponential backoff, respect `Retry-After`, cap attempts, and stop retrying when the user-visible operation is no longer useful. Servers should avoid retrying non-idempotent downstream calls blindly. During overload, bad retries often create more traffic than real users do. Shedding without retry control is only half a design.
The seventh pattern is fairness. Without fairness, one tenant, job type, or integration can spend the shared capacity budget. Rate limits, tenant quotas, priority queues, and reserved capacity can keep the API useful for many users while one path is noisy. Fairness does not mean every request is equal. It means the overload policy matches the product promise.
The eighth pattern is observability of shed traffic. Count accepted, rejected, queued, timed out, canceled, retried, and brownout-served requests separately. Label them by endpoint class and tenant where safe. Operators should see which policy fired and what it protected. If load shedding works but nobody can see it, the next incident review will call it random behavior.
Bad load shedding usually has the same smell: the system says yes until it collapses. It accepts work it cannot finish, queues it behind unrelated tasks, waits until clients give up, then keeps processing abandoned requests. The graphs show high latency and low errors because the server is still trying. Users experience a broken product while the API insists it is being patient.
Good load shedding feels boring. The API rejects early, explains when retry may help, preserves critical paths, keeps queue age bounded, and gives operators clear evidence. It may still disappoint users, but it disappoints them honestly and briefly instead of turning overload into a mystery.
A practical design review should ask: what work can be refused, what work can be delayed, what work must be protected, what response should callers see, how do clients retry, and which dashboard proves the policy fired? If those answers are not written down, the API already has a load shedding design. It is just accidental.
Under high load, performance is not only speed. It is choice. The system chooses who waits, who gets a clear no, which work remains safe, and how recovery begins. Load shedding design is the discipline of making those choices before the spike makes them for you.
Another useful pattern is a priority budget. Reserve some concurrency for login, checkout, health-critical reads, or operator repair actions so less important work cannot spend every slot. The reserve should be visible and tested. A hidden reserve that nobody understands will look like unfairness during an incident, while a documented reserve becomes a product decision the team can defend.
Load shedding also needs a recovery policy. When pressure falls, the system should not immediately admit every delayed job and recreate the spike. Ramp queues gradually, keep jitter, and let downstream dependencies prove they are healthy before restoring full concurrency. Recovery storms are common when shedding only answers how to say no and never answers how to safely say yes again.
Client documentation is part of the design. API consumers should know which status codes are retryable, which require human action, which include a retry-after value, and which operations need idempotency keys. If clients have to reverse-engineer overload behavior from production incidents, they will implement inconsistent retries and the server will inherit those mistakes during the next spike.
Finally, test load shedding before the real event. Create a staging or controlled production drill where one dependency slows down, a queue ages, or a tenant sends excess work. Verify the policy rejects the intended class, protects the critical path, emits clear metrics, and recovers gradually. A load shedding design that has never fired is only a design proposal.
A design should also describe who is allowed to change thresholds. During a spike, raising limits can feel like relief while it pushes damage into a database, queue, or vendor API. Lowering limits can protect the system while angering customers. Put those choices in an operational policy with owners, dashboards, and rollback steps instead of leaving them to the loudest person in chat.
The final pattern is product copy. Users and API clients should know whether work was rejected, delayed, or accepted for later processing. A vague timeout creates repeat clicks and duplicate jobs. A clear overload response can preserve trust even when the system says no. Under pressure, honest product behavior is part of capacity management.
Load shedding should be reviewed whenever the product adds a new expensive path. A new export, report, webhook, AI enrichment, or tenant tier can change which work deserves protection. Capacity rules age just like code. Revisit the policy before the next launch makes yesterday's safe threshold the next bottleneck.
That review should include real traffic examples, because overload behavior that looks fair in a spreadsheet can still punish the wrong users when request cost varies by endpoint, tenant, or payload size.