1.2 million inference requests per second. That's not a theoretical ceiling — it's what we measured in production on a 64-node A100 cluster running Llama 3 70B with Inferex. Here's the architecture that makes it possible.
The Throughput Problem
Single-node inference throughput is well-understood. The hard problem is scaling horizontally while maintaining low latency under variable load. Most teams hit one of three failure modes:
- Load balancer bottleneck: a single L7 proxy becomes the ceiling
- Head-of-line blocking: long requests starve short ones
- Worker cold starts: auto-scaling can't spin up fast enough
Inferex solves all three with a three-layer architecture: request routing, continuous batching, and predictive scaling.
Layer 1: Distributed Request Routing
Instead of a central load balancer, Inferex uses a gossip-based routing mesh. Each client holds a view of worker load, updated every 50ms via UDP broadcast. Routing decisions are made client-side — zero round-trips to a central coordinator. Overhead: 0.1ms per routing decision.
Layer 2: Continuous Batching
Traditional static batching — collect N requests, then process — wastes GPU cycles. Inferex uses continuous batching: requests are inserted into the batch mid-execution, as tokens complete. This keeps GPU utilization above 88% even under bursty traffic patterns.
Layer 3: Predictive Auto-Scaling
We train a lightweight ARIMA model on each customer's request time series. It forecasts demand 90 seconds ahead, triggering scale-out before the traffic wave arrives. Typical scale-out time (warm workers): 8 seconds. Cold start: 45 seconds (we maintain a warm pool).
Results at 1.2M req/s
- 64-node cluster, A100 80GB per node, Llama 3 70B (FP8 quantized)
- Peak measured throughput: 1,240,000 req/s
- P99 latency at peak: 11.2ms
- GPU utilization: 89% cluster-wide
- Routing overhead: 0.1ms p99
- Scale-out time from 10% to 100% capacity: 38 seconds
What's Next
We're currently testing a prefill/decode disaggregation architecture that should push P99 under 6ms at 1M+ req/s by separating the compute-bound prefill phase from the memory-bound decode phase. Results in Q3 2026.