Towards AIblog

Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free

Wednesday, June 3, 2026Mehedi HasanView original
Last Updated on June 3, 2026 by Editorial Team Author(s): Mehedi Hasan Originally published on Towards AI. Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free You now know how to make the model fast (Part 1) and how to build a stable serving layer around it (Part 2). The final question is: which engine actually implements all of this without forcing you to write a custom scheduler from scratch? The theme of this part: inference engines are not neutral wrappers. They bake in specific opinions about batching, KV cache memory layout, prefix caching, and kernel selection. Pick the engine that aligns with your pain points, and you get chunked prefill, continuous batching, and paged KV cache for free. Pick the wrong one, and you’ll spend sprints reimplementing features the right engine already has. Here is how the four major runtimes compare in 2026, with exact configs and the tradeoffs that matter for production. vLLM: The Production Default vLLM is the safest starting point for most teams. Its core innovation — PagedAttention — treats the KV cache like virtual memory with fixed-size blocks, reducing fragmentation from 60–80% in naive systems to under 4%. This directly translates to 2–4x higher concurrency on the same GPU. What you get out of the box: Continuous batching (iteration-level scheduling): requests enter and leave the GPU every token step, not every batch Chunked prefill (v0.4+): long prompts are broken into chunks and interleaved with decode steps, so a 3K-token prefill doesn’t starve short chat requests Automatic prefix caching (APC): the engine detects shared prompt prefixes and reuses KV cache automatically Speculative decoding (EAGLE, Medusa, n-gram): 2–3x latency reduction for memory-bound decode Multi-LoRA serving: serve hundreds of fine-tuned adapters on one base model Broad quantization support: GPTQ, AWQ, FP8, INT8, INT4, AutoRound 200+ model architectures: Llama, Qwen, DeepSeek, Mixtral, MoE, VLMs, embedding models The config that matters: python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.85 \ --max-model-len 8192 \ --max-num-seqs 64 \ --enable-prefix-caching \ --enable-chunked-prefill \ --quantization fp8 Key flags explained: --enable-prefix-caching: turns on automatic prefix caching for shared system prompts (massive for RAG) --enable-chunked-prefill: prevents long prefill monopolization; interleaves prefill chunks with decode --gpu-memory-utilization 0.85: leaves 15% headroom for CUDA graph capture and KV cache growth; going to 0.95 often causes OOM during graph compilation --max-num-seqs 64: caps concurrent sequences. Higher isn’t always better—if you hit memory limits, the engine will evict blocks and thrash. MRV2 (Model Runner V2): In v0.17.0+, enable VLLM_USE_V2_MODEL_RUNNER=1 for a rewritten backend that delivers significant throughput gains, especially on newer architectures like GB200. When to choose vLLM: You support many different models and need one engine to handle them all You run on heterogeneous hardware (NVIDIA, AMD, Intel Gaudi, AWS Trainium) Your team wants the largest community, best documentation, and fastest debugging You need to be online in under 90 seconds from a cold start Limitation: Peak throughput on dedicated H100 clusters is ~29% lower than SGLang or LMDeploy in some benchmarks, primarily due to Python orchestration overhead. If you have a fixed model and a specialized team, you can squeeze more out of other engines. But for most teams, vLLM’s breadth outweighs that gap. SGLang: The Throughput Challenger with Automatic Prefix Caching SGLang, developed by LMSYS (the team behind Chatbot Arena), is no longer a niche alternative. It powers xAI’s Grok 3 and Microsoft Azure’s DeepSeek R1 deployments, running on over 400,000 GPUs worldwide. What differentiates it: RadixAttention. Instead of manually configuring prefix caches, SGLang builds a radix tree from request prefixes and automatically reuses KV cache across any requests that share token sequences. This is transformative for multi-turn chat, agent loops, and RAG pipelines where system prompts and retrieved contexts repeat. What you get out of the box: RadixAttention: automatic, dynamic prefix caching without manual key management Chunked prefill: same interleaving benefit as vLLM EAGLE/EAGLE3 speculative decoding: state-of-the-art draft-model speculation Prefill-decode disaggregation: separate prefill and decode across different GPU pools for independent scaling MLA-optimized kernels: specifically tuned for DeepSeek models Zero-overhead CPU scheduler: moves scheduling logic off the GPU thread The config that matters: python -m sglang.launch_server \ --model-path meta-llama/Llama-3.3-70B-Instruct \ --tp 2 \ --quantization fp8 \ --context-length 8192 \ --mem-fraction-static 0.92 \ --enable-flashinfer-mla \ --host 0.0.0.0 \ --port 8000 Key flags explained: --tp 2: tensor parallelism across 2 GPUs --mem-fraction-static 0.92: SGLang’s memory allocator is more aggressive than vLLM’s; 0.92 is typically stable on H100 --enable-flashinfer-mla: enables optimized Multi-Head Latent Attention kernels for DeepSeek-class models Performance reality check: In H100 benchmarks with unique prompts (no prefix sharing), SGLang achieves roughly 29% higher throughput than vLLM. However, the gap narrows or reverses on workloads with high memory pressure where vLLM’s PagedAttention is more mature. The real win is in shared-prefix workloads — multi-turn conversations, agent loops, and RAG with fixed retrievers — where RadixAttention provides gains no other engine matches automatically. When to choose SGLang: Your workload is dominated by multi-turn conversations or shared system prompts You are serving DeepSeek models (MLA kernels are best-in-class) You have a dedicated inference team that can manage dependencies (FlashInfer can be finicky to install) You need prefill-decode disaggregation at scale Limitation: Model coverage is narrower than vLLM. If you serve exotic architectures or need to swap models frequently, vLLM is safer. TensorRT-LLM: The NVIDIA Optimizer (With a Catch) TensorRT-LLM is NVIDIA’s official inference SDK. It delivers the highest raw throughput and lowest TTFT on NVIDIA hardware when fully tuned. But it makes very specific tradeoffs. The compiled engine tradeoff: Traditionally, TensorRT-LLM required compiling a model into a serialized engine — a process that takes ~28 minutes for a 70B model. This is a one-time cost per model version, but it breaks auto-scaling and blue-green deploys unless you precompile and cache engines. The PyTorch backend (v1.0+): This changed the game. TensorRT-LLM now defaults to a PyTorch backend that loads HuggingFace weights directly, cutting cold start to ~60–90 seconds (comparable to vLLM). You lose some peak throughput compared to the compiled engine, but you […]