OrbitQuant makes low-bit weight-and-activation quantization practical for large image and video models, reducing the memory and compute cost of generation without calibration data. We apply OrbitQuant to multiple image and video models, showing that these models can be hosted with a much smaller memory footprint, including meaningful image inference even with 2-bit weights.
Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd–Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to 2-bit weights and 4-bit activations with usable generation quality.
Overview of OrbitQuant. (1) DiT activations drift across timesteps and CFG branches, so calibrated scales do not transfer. (2) The RPBH rotation Πd maps raw activations to well-behaved coordinates. Folded into the weights, it cancels inside each layer (Ŵ′x̂′ ≈ Wx). (3) Rotated coordinates concentrate around one fixed marginal fd ≈ N(0, 1/d), so a single Lloyd–Max codebook per dimension serves all layers, timesteps, prompts, and both image and video DiTs, with no calibration.
OrbitQuant matches or exceeds the best calibration-based baselines at W4A4 on GenEval, and is the only PTQ method that produces usable images at W2A4 — all prior methods collapse to near-zero. On Wan 2.1-1.3B, OrbitQuant leads all baselines at both W4A6 and W4A4 on Imaging, Aesthetic, Scene, and Overall Consistency. No calibration data is used at any bit-width or modality.
BF16 vs OrbitQuant W4A4 on Wan 2.1-14B. Each row shows the same prompt under full precision (left) and 4-bit weight & activation quantization (right).
Color drop in water, ink swirling in water, colorful ink in water, abstraction fancy dream cloud of ink
A car moving slowly on an empty street, rainy evening
A turtle swimming in ocean
Across diverse prompts involving motion, lighting, and fine texture, OrbitQuant W4A4 remains visually aligned with BF16 outputs despite operating at substantially lower precision.
Wan 2.1-14B at W4A4 across 5 methods. Each row shows the same prompt across BF16, OrbitQuant, SmoothQuant, QuaRot, and ViDiT-Q. Click any video to expand for side-by-side inspection.
A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh
Color drop in water, ink swirling in water, colorful ink in water, abstraction fancy dream cloud of ink
A jellyfish floating through the ocean, with bioluminescent tentacles
A turtle swimming in ocean
A car moving slowly on an empty street, rainy evening
Origami dancers in white paper, 3D render, on white background, studio shot, dancing modern dance
Robot dancing in Times Square
The 6 prompts are shown in the same order across every quantization strength. The same 5 methods appear in the same column order on every row. Behavior across the three bit settings (W2A4, W3A3, W4A4) tells the qualitative story.
W2A4 denotes 2-bit weight quantization and 4-bit activation quantization. Same 6 prompts in the same order across every bit setting.
| Prompt | BF16 | OrbitQuant W2A4 (Ours) | QuaRot W2A4 | ViDiT-Q W2A4 | SmoothQuant W2A4 |
|---|---|---|---|---|---|
|
An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography |
|||||
|
A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field |
|||||
|
A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography |
|||||
|
A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography |
|||||
|
A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic |
|||||
|
A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting |
Click any image to enlarge. 6 prompts × 5 methods.
W3A3 denotes 3-bit weight quantization and 3-bit activation quantization. Same 6 prompts in the same order across every bit setting.
| Prompt | BF16 | OrbitQuant W3A3 (Ours) | QuaRot W3A3 | ViDiT-Q W3A3 | SmoothQuant W3A3 |
|---|---|---|---|---|---|
|
An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography |
|||||
|
A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field |
|||||
|
A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography |
|||||
|
A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography |
|||||
|
A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic |
|||||
|
A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting |
Click any image to enlarge. 6 prompts × 5 methods.
W4A4 denotes 4-bit weight quantization and 4-bit activation quantization. Same 6 prompts in the same order across every bit setting.
| Prompt | BF16 | OrbitQuant W4A4 (Ours) | QuaRot W4A4 | ViDiT-Q W4A4 | SmoothQuant W4A4 |
|---|---|---|---|---|---|
|
An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography |
|||||
|
A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field |
|||||
|
A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography |
|||||
|
A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography |
|||||
|
A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic |
|||||
|
A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting |
Click any image to enlarge. 6 prompts × 5 methods.
@misc{lee2026orbitquant,
title = {OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers},
author = {Lee, Donghyun and Chavan, Jitesh and Nguyen, Duy and Huang, Sam and Jiang, Liming and Panda, Priyadarshini and Mertens, Timo and Shukla, Saurabh},
year = {2026},
eprint = {2607.02461},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2607.02461}
}