The inference serving landscape has fragmented dramatically in 2025. vLLM, TensorRT-LLM, OpenLLM, Triton Inference Server, and now Inferex — engineers are drowning in options with no independent benchmark data. We're publishing ours. Yes, we have an obvious interest in this benchmark. Read it critically.
Methodology
All tests run on a single NVIDIA A100 80GB SXM4, Ubuntu 22.04, CUDA 12.3, driver 545.23. Models: Llama 3 8B and Llama 3 70B. Workloads: synthetic load with 512-token inputs and 256-token outputs, Poisson arrival process at target request rates. Each benchmark run: 10-minute warmup, 30-minute measurement window. P99 latency reported at 90th percentile of load the system could sustain without queue growth.
Llama 3 8B — Single A100 80GB
- vLLM v0.6.4 (FP16): Max 420 req/s, P99 22ms, GPU util 74%
- TensorRT-LLM v0.12 (FP8): Max 680 req/s, P99 14ms, GPU util 83%
- Inferex v2.4 (INT4/FP8 mixed): Max 940 req/s, P99 7.8ms, GPU util 91%
Llama 3 70B — Single A100 80GB
(Model requires tensor parallelism across multiple GPUs for FP16. All benchmarks at INT4/INT8 to fit on single card.)
- vLLM v0.6.4 (GPTQ INT4): Max 68 req/s, P99 87ms, GPU util 69%
- TensorRT-LLM v0.12 (INT4-AWQ): Max 112 req/s, P99 52ms, GPU util 79%
- Inferex v2.4 (INT4 GPTQ + kernel opt): Max 198 req/s, P99 21ms, GPU util 90%
Setup Complexity
- vLLM: pip install + 3 lines of Python. 15 minutes to production. No optimization configuration required.
- TensorRT-LLM: Docker container, model conversion step (20–90 minutes for large models), NVIDIA-specific toolchain. Production ready in 2–4 hours for a new model.
- Inferex: API key + 10 lines of Python (or REST API). Optimization applied automatically server-side. 10 minutes to production.
Where vLLM Wins
vLLM is the easiest entry point and has the best community support. If you're prototyping or running a low-traffic service under 100 req/s, vLLM is the right choice. The gap to Inferex at low load is small; the simplicity advantage is real.
Where TensorRT-LLM Wins
If you are 100% NVIDIA GPU, need max hardware utilization, and have engineering bandwidth to maintain the toolchain, TensorRT-LLM is a strong choice. It outperforms vLLM significantly and approaches Inferex performance on some workloads.
Where Inferex Wins
Multi-hardware environments (GPU + CPU + edge). High-concurrency workloads where every millisecond counts. Teams without ML infrastructure expertise who want optimization without operational complexity. Observability and auto-scaling included rather than bolted on.