Rate Limit Impact Modeler

What this skill does

Given an API's rate limit and a workload's offered load, computes expected throughput, mean queue depth, and 95th-percentile wait time using M/M/1 (or M/M/c) approximations. Output is a one-page markdown brief with numbers and a recommendation.

Inputs

rate_limit_rps: the steady-state allowed request rate (requests per second).
burst_capacity: token-bucket burst size (defaults to rate_limit_rps).
offered_load_rps: the workload's request rate (mean).
Optional concurrency: number of parallel callers (defaults to 1, M/M/1).
Optional mean_service_time_ms: per-request server time; defaults to 1000 / rate_limit_rps.

Steps

Compute utilization rho = offered_load_rps / (concurrency * rate_limit_rps).
If rho >= 1, the queue grows unbounded: emit a "saturation" brief that reports time-to-fill the burst bucket = burst_capacity / (offered_load_rps - rate_limit_rps).
Else apply M/M/1 (or M/M/c via Erlang-C):
- mean queue length L_q = rho^2 / (1 - rho) for M/M/1.
- mean wait time W_q = L_q / offered_load_rps.
- p95 wait time approximation W_q95 = W_q * 3 (rule of thumb for exponential distributions).
Compute throughput: effective_throughput_rps = min(offered_load_rps, concurrency * rate_limit_rps).
Compute drop probability if a finite buffer is assumed (default buffer = burst_capacity): P_drop = (1 - rho) * rho^N / (1 - rho^(N+1)) where N = buffer.
Recommend: if rho > 0.8, suggest scaling concurrency or negotiating the limit; if 0.5 < rho <= 0.8, suggest jittered backoff.
Compose a markdown report with input echo, computed values, the recommendation, and a small ASCII chart (rho vs. wait time).
Save to rate-limit-model.md.

Output

rate-limit-model.md with: an inputs block, a computed-results table (rho, L_q, W_q, p95, throughput, drop probability), and a recommendation paragraph. Stdout prints rho and effective throughput for quick consumption.

Verification

Plug the inputs into a Monte Carlo simulator (a 10-line Python script using random.expovariate) for 100k synthetic requests; verify the simulated mean wait time is within 20% of the computed W_q. If divergence exceeds 20%, the workload is likely non-Poisson — surface a caveat noting M/M/1 assumptions broke. Sanity-check effective_throughput_rps <= concurrency * rate_limit_rps.

Edge cases

offered_load_rps == 0: emit a trivial brief stating "no impact, queue empty".
Workloads with long-tail service times (heavy-tailed): warn that M/M/1 underestimates p95 by orders of magnitude; recommend a simulator instead.
Token-bucket with refill independent of consumption: model as c = 1 server with capacity burst_capacity; document the simplification.
Rate limit applied per-user not global: input should be the per-caller limit; report should not aggregate across users.

amitte/rate-limit-impact-modeler