Towards AIblog

Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99

Wednesday, June 3, 2026Mehedi HasanView original
Last Updated on June 3, 2026 by Editorial Team Author(s): Mehedi Hasan Originally published on Towards AI. Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99 You’ve quantized the model, switched to Flash Attention, and maybe even dropped to INT4. Your GPU kernels are now efficient. But users still complain that the app is “sometimes slow.” Welcome to serving hell, where the bottleneck is rarely the model and almost always the system around it. The theme of this part: once the model is efficient, most production wins come from queueing discipline, traffic routing, and stability controls. P95 and P99 latency are not driven by tensor core utilization. They’re driven by queueing, noisy neighbors, long prompts stuck behind short ones, and slow clients holding onto GPU memory. System Level Techniques to reduce LLM production latency The Real Enemy Is Queueing, Not Compute Here is the counterintuitive truth of production LLM serving: most latency is waiting time, not compute time. A request that takes 50ms of actual GPU work can easily spend 800ms in a queue because the batcher decided to wait for one more request, or because a 4K-token RAG prompt monopolized the prefill slot. P95 and P99 latency are almost always caused by: Queueing: Requests piling up behind a large batch, or when the KV cache pool is exhausted and no new slots are free Noisy neighbors: One tenant submits a 10K-token prompt and stalls everyone else Long prompts: Prefill dominates the GPU, starving decode steps Slow clients: A streaming client on a 3G connection buffers tokens and pins GPU memory Cold starts: A freshly scaled replica that hasn’t loaded weights or allocated KV cache If you only optimize median latency, you miss the real user experience. Users remember the one time they waited three seconds for the first token. Your product metrics will look fine while your user trust erodes. Measure the Right Things Before you fix anything, you need metrics that split the problem correctly. Most teams log “total request time” and call it a day. That is useless. Log these on every request: Time-to-first-token (TTFT): the user’s perception of responsiveness Time Per Output Token (TPOT): the standard industry metric for decode speed (e.g., 20ms/token). It is the inverse of tokens-per-second, making it easier to calculate SLAs Prompt tokens and output tokens: separates prefill cost from decode cost Queue wait time (KV Cache Starvation): time spent before the GPU starts work. Note: requests usually queue not because the GPU is at 100% compute, but because it has run out of PagedAttention blocks in the KV cache Prefill time and decode time separately: tells you which phase to optimize P95 and P99 per lane: not global P99, but per traffic lane (interactive vs. batch, short vs. long) Why per-lane matters: If you mix a 50-token chat query with a 4K-token legal document summary, your global P99 will be dominated by the long prompt. You’ll optimize the wrong thing. Split your metrics by lane and optimize each lane independently. Practical implementation: Most teams pipe these into Prometheus or Datadog. The key is tagging every metric with lane, model, quantization, and gpu_type. If you can’t segment, you can’t diagnose. Traffic Shaping: Separate Your Lanes The single most effective serving optimization is also the simplest: don’t let different workloads fight over the same GPU. Interactive vs. Batch Interactive traffic (chat, streaming UI) needs low TTFT. Batch traffic (background summarization, embedding generation) needs high throughput. They want opposite things from the scheduler. The rule: Run them on separate replicas, or at minimum, separate queues with different scheduling policies. In vLLM, you can approximate this with separate engine instances: # Interactive lane: small max batch, prioritize TTFTpython -m vllm.entrypoints.api_server \ --model your-model \ --max-num-seqs 4 \ --max-model-len 4096 \ --port 8000# Batch lane: larger batch, tolerate higher latencypython -m vllm.entrypoints.api_server \ --model your-model \ --max-num-seqs 16 \ --max-model-len 8192 \ --port 8001 In TGI, the scheduler is less configurable per instance, so the cleanest approach is separate deployments behind a router. Prompt-Length Lanes Even within interactive traffic, a 100-token prompt and a 3K-token RAG prompt should not share a queue. The long prompt will stall the short one during prefill. Fast lane: Prompts under ~512 tokens. Strict max wait time (5–10ms), small batch cap. Slow lane: Prompts over ~512 tokens. Longer wait time allowed (50–100ms), larger batch cap acceptable. Router logic (pseudocode concept): if prompt_tokens < 512 and streaming == true: route_to_fast_lane()else: route_to_slow_lane() The threshold depends on your model and GPU. Measure where your prefill time starts to dominate TTFT and draw the line there. The Modern Alternative: Chunked Prefill While routing by prompt length is a great architectural defense, modern inference engines (like vLLM 0.4+ and TGI) now solve this at the scheduler level via Chunked Prefill. Instead of computing a 3,000-token prefill in one giant block (which starves all other requests of decode steps), the engine breaks the prefill into smaller chunks (e.g., 512 tokens). It computes one chunk, runs a decode step for active streams, computes the next chunk, and so on. (We’ll cover which engines support this in Part 3.) Continuous Batching and SLA Caps Static batching is dead. Today, engines use Continuous Batching (or iteration-level scheduling). Instead of waiting for a batch to fill, the scheduler greedily injects new requests the moment a single token finishes and KV cache frees up. The danger of continuous batching is that greedy schedulers can ruin TTFT. Set your maximum wait times to your TTFT SLA. If your interactive SLA is 100ms, configure the router so no request waits more than 80ms in the queue, leaving 20ms for the actual prefill compute. (Note: this 80/20 math only applies to the fast-lane with short prompts; long prompts will need dedicated SLA tracking.) In vLLM, this is handled internally, but you control the tradeoff via --max-num-seqs (max concurrent sequences) and --max-model-len. For more explicit control, some teams run custom dispatchers: # Dispatcher pseudocodeMAX_WAIT_MS = 10MAX_BATCH = 8while True: batch = queue.collect_until( max_items=MAX_BATCH, timeout_ms=MAX_WAIT_MS […]