Towards AIblog

3-Part Series: LLM Latency in Production (Part 1)

Wednesday, June 3, 2026Mehedi HasanView original
Last Updated on June 3, 2026 by Editorial Team Author(s): Mehedi Hasan Originally published on Towards AI. 3-Part Series: LLM Latency in Production (Part 1) Originally published at https://mhabir.substack.com. Part 1 — Model-Level Speed: Make the Model Fast on the GPU If you’re shipping LLMs to production, your first performance bottleneck isn’t serving logic or network overhead-it’s the raw arithmetic happening inside the GPU. Most teams waste weeks tuning their batching logic before realizing their model baseline is 3–4x slower than it should be. This part is about fixing that baseline. Why LLM Inference Is Memory-Bandwidth Bound (Especially in Decode) The fundamental misconception: LLMs are not always compute-bound. Decode is typically memory-bandwidth bound, while prefill is mixed (compute + memory) and becomes kernel-sensitive, especially with long contexts. Here’s the intuition that proves it. A 7B parameter model in FP16 needs 14 GB just for weights. For a single token generation step (decode), you’re moving those 14 GB through GPU memory bandwidth (TB/s-class HBM) to do ~ 14 GFLOPs of computation. That’s an arithmetic intensity around 1 FLOP/byte-well below the roofline where compute becomes the limit. On modern GPUs, you’d need >200 FLOP/byte to saturate tensor cores. In practice, during decode, you’re waiting on HBM reads, not matrix multiplications. This has two consequences: Batching helps because amortizing weight loads across multiple sequences improves effective memory bandwidth utilization. Quantization is a bandwidth win: INT4 weights are 4x smaller, so you move 4x less data per token. That directly translates to lower latency. Caveat: This is an upper-bound mental model. Effective traffic depends on batching, caching, and parallelization-real workloads see less than this theoretical maximum. Every LLM request has two phases with completely different performance characteristics. The Prefill-Decode Asymmetry Prefill processes the entire prompt in one forward pass, but it’s compute-intensive and memory-heavy because you’re building the KV cache. For a 4K-token prompt, you’re doing attention over a 4K sequence in parallel-not 4K autoregressive steps-creating an O(n²) attention matrix and storing 4K × hidden_dim × num_layers × 2 (K and V) values. This can be multiple GB per request on large models. Decode generates tokens autoregressively. Each step processes one token, but reuses the KV cache. It’s memory-bandwidth dominated because you’re streaming the entire KV cache through HBM on every step. This asymmetry means your optimization strategy must be phase-aware. Faster prefill requires better attention kernels (Flash Attention). Faster decode requires better cache management (paged KV, quantization). Quantization is the single most effective model-level optimization. It reduces memory footprint, improves bandwidth efficiency, and often comes with minimal quality loss. INT8 (LLM.int8()) uses vector-wise quantization with outlier preservation. It’s the safest starting point-most models show <0.1% perplexity degradation. Implementation is straightforward: # bitsandbytes INT8 inferencepip install bitsandbytes In your model loading code: from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, # outlier threshold llm_int8_has_fp16_weight=False ) This works out-of-the-box in vLLM and TGI: # TGI with bitsandbytes INT8text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize bitsandbytes# vLLM with INT8 (via config)python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --quantization bitsandbytes INT4 is where the real speedup lives. You achieve 4x memory reduction and 2–3x latency improvement, but measurable quality degradation occurs. Always validate with your actual prompt distribution. AWQ: Activation-Aware Weight Quantization AWQ’s key insight: not all weights are equally important. Activation magnitudes reveal which weights matter most. By scaling weights based on activation statistics, AWQ achieves better 4-bit accuracy than naive quantization. Installation & Usage: git clone https://github.com/mit-han-lab/llm-awqcd llm-awqpip install -e .cd awq/kernels && python setup.py install # Build efficient CUDA kernels Quantize a model: # Step 1: AWQ search (calibration)python -m awq.entry --model_path meta-llama/Llama-2-7b-hf \ --w_bit 4 --q_group_size 128 \ --run_awq --dump_awq llama-2-7b-w4-g128.pt# Step 2: Generate quantized weightspython -m awq.entry --model_path meta-llama/Llama-2-7b-hf \ --w_bit 4 --q_group_size 128 \ --load_awq llama-2-7b-w4-g128.pt \ --q_backend real --dump_quant llama-2-7b-w4-g128-awq.pt In vLLM/TGI, use pre-quantized models: # vLLM with AWQ (supported in many recent versions)python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-AWQ --quantization awq# TGI with AWQtext-generation-launcher --model-id TheBloke/Llama-2-7B-AWQ AWQ Configuration Details: q_group_size=128: Weights are quantized in groups of 128 channels. Smaller groups improve accuracy but increase quantization overhead. w_bit=4: 4-bit quantization. AWQ also supports 3-bit for extreme compression. version="GEMM": Choose between GEMM (general matrix multiply) or GEMV (vector) kernels. GEMM is faster for batch sizes > 1. GPTQ: Gradient-Based Post-Training Quantization GPTQ uses second-order information (Hessian) to minimize quantization error. It’s slightly more computationally expensive to quantize but produces excellent 4-bit models. Installation: pip install auto-gptq --no-build-isolation Quantization: from transformers import AutoTokenizerfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfigmodel = AutoGPTQForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantize_config=BaseQuantizeConfig( bits=4, group_size=128, desc_act=False, # False for speed, True for slight quality improvement ))# Calibrate with ~128-256 samples from your domainexamples = [...] # List of tokenized samplesmodel.quantize(examples)model.save_quantized("llama-2-7b-gptq") GPTQ in Serving: # vLLM with GPTQpython -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-GPTQ --quantization gptq Key GPTQ Configs: desc_act=False: Disables activation reordering. This is 2-3x faster in inference with minimal quality loss. Set to True only if perplexity degradation is > 2%. use_marlin=True: On Ampere GPUs (A100, RTX 30xx/40xx), Marlin kernels are 30-50% faster than default exllamav2. bitsandbytes NF4/FP4: The No-Precompute Option bitsandbytes 4-bit (used in QLoRA) quantizes on-the-fly during model loading. No calibration needed, but inference is often slower than AWQ/GPTQ because quantization happens per forward pass. Use when: Config: from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # or "fp4" bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True # Compresses quantization constants ) Performance note: NF4 inference is often slower than AWQ/GPTQ for pure inference because of runtime dequantization overhead. Use it for development, not max-throughput serving. GPU Acceleration: Kernels That Actually Matter Quantization reduces memory traffic. These kernels make the traffic you do have more efficient. Flash Attention: The Prefill King Flash Attention eliminates the need to materialize the full N×N attention matrix. Instead, it tiles the computation and uses smart memory management to reduce HBM reads/writes by 10–20x in theory, with typical speedups of 30–50% in practice on long sequences. Installation: pip install flash-attn --no-build-isolation In Practice: FlashAttention-2 is integrated into all major inference engines. You just need to install it before building your engine. For custom PyTorch code, use the flash_attn […]