Scaling LLM Throughput to 1M Requests Per Second

1.2 million inference requests per second. That's not a theoretical ceiling — it's what we measured in production on a 64-node A100 cluster running Llama 3 70B with Inferex. Here's the architecture that makes it possible.

The Throughput Problem

Single-node inference throughput is well-understood. The hard problem is scaling horizontally while maintaining low latency under variable load. Most teams hit one of three failure modes:

Load balancer bottleneck: a single L7 proxy becomes the ceiling
Head-of-line blocking: long requests starve short ones
Worker cold starts: auto-scaling can't spin up fast enough

Inferex solves all three with a three-layer architecture: request routing, continuous batching, and predictive scaling.

Layer 1: Distributed Request Routing

Instead of a central load balancer, Inferex uses a gossip-based routing mesh. Each client holds a view of worker load, updated every 50ms via UDP broadcast. Routing decisions are made client-side — zero round-trips to a central coordinator. Overhead: 0.1ms per routing decision.

Layer 2: Continuous Batching

Traditional static batching — collect N requests, then process — wastes GPU cycles. Inferex uses continuous batching: requests are inserted into the batch mid-execution, as tokens complete. This keeps GPU utilization above 88% even under bursty traffic patterns.

Layer 3: Predictive Auto-Scaling

We train a lightweight ARIMA model on each customer's request time series. It forecasts demand 90 seconds ahead, triggering scale-out before the traffic wave arrives. Typical scale-out time (warm workers): 8 seconds. Cold start: 45 seconds (we maintain a warm pool).

Results at 1.2M req/s

64-node cluster, A100 80GB per node, Llama 3 70B (FP8 quantized)
Peak measured throughput: 1,240,000 req/s
P99 latency at peak: 11.2ms
GPU utilization: 89% cluster-wide
Routing overhead: 0.1ms p99
Scale-out time from 10% to 100% capacity: 38 seconds

What's Next

We're currently testing a prefill/decode disaggregation architecture that should push P99 under 6ms at 1M+ req/s by separating the compute-bound prefill phase from the memory-bound decode phase. Results in Q3 2026.