You Do Not Need 50 Diffusion Steps. Here Is What Nvidia Proved at GTC.
Author(s): Siddhant Nitin Patil Originally published on Towards AI. You Do Not Need 50 Diffusion Steps. Here Is What Nvidia Proved at GTC. The video diffusion industry has had the same conversation for two years. Better model. More parameters. Higher resolution. Longer clips. Richer motion. And underneath all of it, the same silent constraint that nobody advertises: generating a single second of 720p video still takes long enough to make most real-time use cases a fantasy. At GTC 2026 in San Jose, Nvidia’s Ziv Ilan from the AI Labs team in Paris gave a 20-minute talk that reframed the problem entirely. The title: You Might Not Need 50 Diffusion Steps. The argument was not about a new model. It was about what happens when you stop treating the step count as a fixed constraint and start treating it as an engineering variable. Why Step Count Is the Real Bottleneck Diffusion models generate images and videos through iterative denoising. Random noise gets progressively cleaned up across a series of steps, each step moving the output closer to the final result. Standard production models run 20 to 50 denoising steps. Each step is a full forward pass through a model that, in the case of modern video diffusion architectures, can have 20 to 40 billion parameters. The math compounds fast. A single 1,328 x 1,328 image generated with Qwen-Image involves approximately 12,900 TFLOPs of computation, producing a latency of up to 127 seconds per image on an Nvidia H20 GPU. For video, where you need consistent quality across frames with temporal coherence, the compute demand grows faster than linearly with resolution and duration. This is why Adobe’s Firefly video generation model, before optimization, was architecturally capable but commercially constrained. State-of-the-art image diffusion already took tens of seconds per image. Video diffusion with a 50-step process at production resolution was simply not viable for interactive or real-time applications. The path forward was not a bigger model. It was a smarter inference stack. The Three-Technique Stack Ilan’s talk organized the solution space into three composable techniques: quantization, caching, and distillation. Critically, these are not alternatives. They are stackable. You deploy them in combination, and each one adds a multiplier to the performance gains of the others. Quantization: Making Each Step Cheaper Quantization reduces the numerical precision of the model’s weights and activations from 16-bit or 32-bit floating point to lower-precision formats: INT8, FP8, or even FP4 in the latest research. For LLMs, the impact of quantization is well understood and well documented. Diffusion models present a more complex picture because they are attention-heavy in ways that LLMs are not. The multi-head attention mechanisms in transformer-based diffusion architectures (DiT models) are more sensitive to precision loss than the feed-forward layers in autoregressive models. This means that naive quantization approaches developed for LLMs often produce measurable quality degradation in diffusion models even at INT8 precision. The solution Nvidia has deployed in production, demonstrated through their collaboration with Black Forest Labs on Flux 2, uses dynamic quantization rather than static quantization. Static quantization pre-computes the activation range across a calibration dataset and applies fixed scaling factors at inference time. Dynamic quantization computes activation ranges on the fly per batch, adapting to the actual data distribution being processed. For diffusion models where the latent space evolves significantly across denoising steps, dynamic quantization maintains quality that static approaches cannot match. The hardware layer amplifies this further. Nvidia’s Blackwell architecture introduced NVFP4 support, a 4-bit floating point format that, combined with Blackwell’s dedicated FP4 tensor cores, delivers performance gains that dwarf what FP8 achieved on Hopper. In ComfyUI benchmarks, NVFP4 optimizations on RTX 50-series cards delivered up to 3x performance boosts over FP16 baselines. For Stable Diffusion 3.5 Large, FP8 quantization alone cuts the VRAM requirement from 18GB to 11GB, opening up mid-range 12GB GPUs for a model that previously required 24GB. The Adobe Firefly case is the most concrete enterprise data point. Using TensorRT with mixed FP8 and BF16 precision on Hopper GPUs via AWS EC2 P5 instances: 60% latency reduction, 40% total cost of ownership reduction, serving more users with fewer GPUs. This is not a research result. It is a production deployment that is live today. One important note from Ilan on diffusion-specific quantization considerations: because these models are more attention-heavy than LLMs, the memory savings from quantization are less dramatic than in the LLM world. The performance gains still matter, but the ratio of memory benefit to compute benefit is different. Quantization should be treated as the entry-point optimization, the lowest-friction gain available, rather than the primary strategy. Quantization gets you into the field. Caching and distillation win the game. Caching: Skipping the Computation You Already Did The second technique exploits a property of diffusion that is counterintuitive until you see it: adjacent denoising steps are highly redundant. When a diffusion model runs 50 steps to generate a video frame, the feature representations in the model’s internal layers do not change dramatically between step 23 and step 24. The high-level structure, the composition, the semantic layout, these are largely determined in the early steps. The middle steps refine. The late steps clean up residual noise and adjust texture. Large swaths of the computation happening in steps 24 through 48 are recalculating values that changed very little from the previous step. This is the same insight that motivated KV caching in LLMs: if you have already computed something and it has not changed meaningfully, do not recompute it. In the autoregressive case, KV cache is straightforward because you are generating one token at a time and the previously computed keys and values are definitionally unchanged. In diffusion, the cache mechanics are more complex because you are denoising across a full latent space simultaneously, but the redundancy is real and measurable. T-cache, the approach Ilan referenced in his talk, operates at the full pixel or latent space level. It computes a similarity metric between the current denoising step’s output and the previous step’s output. If the change […]
