OrbitQuant iconOrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

1Cantina Labs   2University of Southern California   3University of Illinois Urbana-Champaign
Cantina Labs logo University of Southern California logo University of Illinois Urbana-Champaign logo
Impact

OrbitQuant makes low-bit weight-and-activation quantization practical for large image and video models, reducing the memory and compute cost of generation without calibration data. We apply OrbitQuant to multiple image and video models, showing that these models can be hosted with a much smaller memory footprint, including meaningful image inference even with 2-bit weights.

Drag the divider — BF16 on the left, OrbitQuant W4A4 on the right. Same recipe on every backbone.

BF16 W4A4 (Ours)
“An astronaut is riding a horse in the space in a photorealistic style”
BF16 W4A4 (Ours)
“A cat eating food out of a bowl”
BF16 W4A4 (Ours)
“Sunset time lapse at the beach with moving clouds and colors in the sky”

OrbitQuant — a single data-agnostic weight-activation quantizer that transfers across image and video diffusion transformers without per-modality calibration.

Calibration-free Backbone-agnostic W4A4 weights & activations

Abstract

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd–Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to 2-bit weights and 4-bit activations with usable generation quality.

Method

OrbitQuant method overview — RPBH rotation absorbed into weights, shared Lloyd-Max codebook for weights and activations

Overview of OrbitQuant. (1) DiT activations drift across timesteps and CFG branches, so calibrated scales do not transfer. (2) The RPBH rotation Πd maps raw activations to well-behaved coordinates. Folded into the weights, it cancels inside each layer (Ŵ′x̂′ ≈ Wx). (3) Rotated coordinates concentrate around one fixed marginal fd ≈ N(0, 1/d), so a single Lloyd–Max codebook per dimension serves all layers, timesteps, prompts, and both image and video DiTs, with no calibration.

Main Results

OrbitQuant matches or exceeds the best calibration-based baselines at W4A4 on GenEval, and is the only PTQ method that produces usable images at W2A4 — all prior methods collapse to near-zero. On Wan 2.1-1.3B, OrbitQuant leads all baselines at both W4A6 and W4A4 on Imaging, Aesthetic, Scene, and Overall Consistency. No calibration data is used at any bit-width or modality.

Highlighted Video Results — Wan 2.1-14B

BF16 vs OrbitQuant W4A4 on Wan 2.1-14B. Each row shows the same prompt under full precision (left) and 4-bit weight & activation quantization (right).

Color drop in water, ink swirling in water, colorful ink in water, abstraction fancy dream cloud of ink

BF16
OrbitQuant W4A4 (Ours)

A car moving slowly on an empty street, rainy evening

BF16
OrbitQuant W4A4 (Ours)

A turtle swimming in ocean

BF16
OrbitQuant W4A4 (Ours)

Across diverse prompts involving motion, lighting, and fine texture, OrbitQuant W4A4 remains visually aligned with BF16 outputs despite operating at substantially lower precision.

Video Generation Results — Wan 2.1-14B

Wan 2.1-14B at W4A4 across 5 methods. Each row shows the same prompt across BF16, OrbitQuant, SmoothQuant, QuaRot, and ViDiT-Q. Click any video to expand for side-by-side inspection.

A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

Color drop in water, ink swirling in water, colorful ink in water, abstraction fancy dream cloud of ink

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

A jellyfish floating through the ocean, with bioluminescent tentacles

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

A turtle swimming in ocean

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

A car moving slowly on an empty street, rainy evening

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

Origami dancers in white paper, 3D render, on white background, studio shot, dancing modern dance

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

Robot dancing in Times Square

BF16
OrbitQuant W4A4 (Ours)
SmoothQuant
QuaRot
ViDiT-Q

Image Generation Results — FLUX.1-dev

The 6 prompts are shown in the same order across every quantization strength. The same 5 methods appear in the same column order on every row. Behavior across the three bit settings (W2A4, W3A3, W4A4) tells the qualitative story.

W2A4

W2A4 denotes 2-bit weight quantization and 4-bit activation quantization. Same 6 prompts in the same order across every bit setting.

Prompt BF16 OrbitQuant W2A4 (Ours) QuaRot W2A4 ViDiT-Q W2A4 SmoothQuant W2A4

An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography

A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field

A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography

A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography

A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic

A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting

Click any image to enlarge. 6 prompts × 5 methods.

W3A3

W3A3 denotes 3-bit weight quantization and 3-bit activation quantization. Same 6 prompts in the same order across every bit setting.

Prompt BF16 OrbitQuant W3A3 (Ours) QuaRot W3A3 ViDiT-Q W3A3 SmoothQuant W3A3

An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography

A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field

A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography

A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography

A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic

A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting

Click any image to enlarge. 6 prompts × 5 methods.

W4A4

W4A4 denotes 4-bit weight quantization and 4-bit activation quantization. Same 6 prompts in the same order across every bit setting.

Prompt BF16 OrbitQuant W4A4 (Ours) QuaRot W4A4 ViDiT-Q W4A4 SmoothQuant W4A4

An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography

A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field

A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography

A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography

A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic

A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting

Click any image to enlarge. 6 prompts × 5 methods.

BibTeX

@misc{lee2026orbitquant,
  title         = {OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers},
  author        = {Lee, Donghyun and Chavan, Jitesh and Nguyen, Duy and Huang, Sam and Jiang, Liming and Panda, Priyadarshini and Mertens, Timo and Shukla, Saurabh},
  year          = {2026},
  eprint        = {2607.02461},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2607.02461}
}

References

  1. Chen, Lei, et al. “Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers.” Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
  2. Xiao, Guangxuan, et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” International Conference on Machine Learning. PMLR, 2023.
  3. Ashkboos, Saleh, et al. “QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs.” Advances in Neural Information Processing Systems 37 (2024): 100213–100240.
  4. Zhao, Tianchen, et al. “ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.” arXiv preprint arXiv:2406.02540 (2024).
  5. Li, Muyang, et al. “SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models.” arXiv preprint arXiv:2411.05007 (2024).
  6. Zhang, Shaoqiu, et al. “AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization.” arXiv preprint arXiv:2602.09883 (2026).