OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

Donghyun Lee^1,2,†, Jitesh Chavan¹, Duy Nguyen^1,3, Sam Huang¹,
Liming Jiang¹, Priyadarshini Panda², Timo Mertens¹, Saurabh Shukla^1,†

¹Cantina Labs ²University of Southern California ³University of Illinois Urbana-Champaign

^†Correspondence to: saurabh@cantina.ai, donghyun.lee.1@usc.edu

arXiv BibTeX

Impact

OrbitQuant makes low-bit weight-and-activation quantization practical for large image and video models, reducing the memory and compute cost of generation without calibration data. We apply OrbitQuant to multiple image and video models, showing that these models can be hosted with a much smaller memory footprint, including meaningful image inference even with 2-bit weights.

Drag the divider — BF16 on the left, OrbitQuant W4A4 on the right. Same recipe on every backbone.

BF16 W4A4 (Ours)

“An astronaut is riding a horse in the space in a photorealistic style”

BF16 W4A4 (Ours)

“A cat eating food out of a bowl”

BF16 W4A4 (Ours)

“Sunset time lapse at the beach with moving clouds and colors in the sky”

OrbitQuant — a single data-agnostic weight-activation quantizer that transfers across image and video diffusion transformers without per-modality calibration.

Calibration-free Backbone-agnostic W4A4 weights & activations

Abstract

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd–Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to 2-bit weights and 4-bit activations with usable generation quality.

Method

Overview of OrbitQuant. (1) DiT activations drift across timesteps and CFG branches, so calibrated scales do not transfer. (2) The RPBH rotation Π_d maps raw activations to well-behaved coordinates. Folded into the weights, it cancels inside each layer (Ŵ′x̂′ ≈ Wx). (3) Rotated coordinates concentrate around one fixed marginal f_d ≈ N(0, 1/d), so a single Lloyd–Max codebook per dimension serves all layers, timesteps, prompts, and both image and video DiTs, with no calibration.

Main Results

OrbitQuant matches or exceeds the best calibration-based baselines at W4A4 on GenEval, and is the only PTQ method that produces usable images at W2A4 — all prior methods collapse to near-zero. On Wan 2.1-1.3B, OrbitQuant leads all baselines at both W4A6 and W4A4 on Imaging, Aesthetic, Scene, and Overall Consistency. No calibration data is used at any bit-width or modality.

Image Generation: FLUX.1-dev — GenEval

Method	Bits	Single	Two	Count	Color	Position	Color Attr.	Overall
FP16 reference	—	0.984	0.823	0.769	0.771	0.203	0.450	0.667
Q-DiT[1]	W4A4	0.047	0.000	0.009	0.024	0.000	0.003	0.014
SmoothQuant[2]	W4A4	0.003	0.000	0.003	0.011	0.000	0.000	0.007
QuaRot[3]	W4A4	0.634	0.106	0.294	0.346	0.025	0.050	0.243
ViDiT-Q[4]	W4A4	0.709	0.147	0.325	0.410	0.028	0.060	0.280
SVDQuant[5]	W4A4	0.981	0.710	0.610	0.698	0.140	0.300	0.573
adaTSQ[6]	W4A4	0.981	0.770	0.640	0.708	0.260	0.350	0.618
OrbitQuant (Ours)	W4A4	0.988	0.768	0.691	0.755	0.178	0.420	0.633
QuaRot	W2A4	0.006	0.000	0.000	0.000	0.000	0.000	0.001
SmoothQuant	W2A4	0.000	0.000	0.000	0.000	0.000	0.000	0.000
ViDiT-Q	W2A4	0.006	0.000	0.000	0.000	0.000	0.000	0.001
OrbitQuant (Ours)	W2A4	0.956	0.424	0.481	0.678	0.110	0.203	0.475

Video Generation: Wan2.1-1.3B — VBench

Method	Bits	Img ↑	Aes ↑	Motion ↑	Dyn ↑	BG ↑	Subj ↑	Scene ↑	Overall ↑
Full Prec.	—	64.30	58.21	97.37	70.28	95.94	93.84	28.05	24.67
SmoothQuant	W4A6	53.51	49.19	98.01	34.44	94.89	92.66	12.81	22.15
QuaRot	W4A6	56.92	50.36	96.94	54.17	95.36	91.65	14.88	22.65
ViDiT-Q	W4A6	56.24	50.18	94.81	52.43	89.67	82.53	13.45	19.58
SVDQuant	W4A6	58.16	51.27	97.05	49.44	93.74	91.71	14.18	23.26
OrbitQuant (Ours)	W4A6	61.25	56.08	97.76	59.78	95.51	94.23	24.88	24.35
SmoothQuant	W4A4	46.32	36.33	96.39	51.94	95.85	90.39	2.79	15.05
QuaRot	W4A4	51.42	40.49	96.21	52.78	95.76	88.80	5.31	17.98
ViDiT-Q	W4A4	44.51	36.43	96.16	58.06	95.92	89.59	1.85	13.11
SVDQuant	W4A4	57.57	46.30	94.21	72.22	93.16	77.96	12.73	21.91
OrbitQuant (Ours)	W4A4	58.58	53.41	97.42	53.89	95.30	92.98	18.81	23.86

Latency and Memory Analysis — FLUX.1-dev + Wan 2.1-1.3B

All methods evaluated under fake quantization on NVIDIA H100. OrbitQuant is the fastest W+A PTQ method on both image and video — SmoothQuant, QuaRot, and ViDiT-Q run 1.09×, 1.28×, and 1.40× slower on image. OrbitQuant matches BF16 peak memory on image; the small video overhead (20.3 vs 19.3 GB) is a fake-quant artifact.

OrbitQuant latency and peak memory vs. SmoothQuant, QuaRot, ViDiT-Q on image (FLUX.1-dev) and video (Wan 2.1-1.3B) — **Latency and peak memory** (lower-left is better). Left: image generation on FLUX.1-dev. Right: video generation on Wan 2.1-1.3B. OrbitQuant is the fastest PTQ method on both modalities while matching BF16 peak memory on image.

Highlighted Video Results — Wan 2.1-14B

BF16 vs OrbitQuant W4A4 on Wan 2.1-14B. Each row shows the same prompt under full precision (left) and 4-bit weight & activation quantization (right).

Color drop in water, ink swirling in water, colorful ink in water, abstraction fancy dream cloud of ink

BF16

OrbitQuant W4A4 (Ours)

A car moving slowly on an empty street, rainy evening

BF16

OrbitQuant W4A4 (Ours)

A turtle swimming in ocean

BF16

OrbitQuant W4A4 (Ours)

Across diverse prompts involving motion, lighting, and fine texture, OrbitQuant W4A4 remains visually aligned with BF16 outputs despite operating at substantially lower precision.

Video Generation Results — Wan 2.1-14B

Wan 2.1-14B at W4A4 across 5 methods. Each row shows the same prompt across BF16, OrbitQuant, SmoothQuant, QuaRot, and ViDiT-Q. Click any video to expand for side-by-side inspection.

A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

Color drop in water, ink swirling in water, colorful ink in water, abstraction fancy dream cloud of ink

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

A jellyfish floating through the ocean, with bioluminescent tentacles

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

A turtle swimming in ocean

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

A car moving slowly on an empty street, rainy evening

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

Origami dancers in white paper, 3D render, on white background, studio shot, dancing modern dance

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

Robot dancing in Times Square

BF16

OrbitQuant W4A4 (Ours)

SmoothQuant

QuaRot

ViDiT-Q

Image Generation Results — FLUX.1-dev

The 6 prompts are shown in the same order across every quantization strength. The same 5 methods appear in the same column order on every row. Behavior across the three bit settings (W2A4, W3A3, W4A4) tells the qualitative story.

W2A4

W2A4 denotes 2-bit weight quantization and 4-bit activation quantization. Same 6 prompts in the same order across every bit setting.

Prompt	BF16	OrbitQuant W2A4 (Ours)	QuaRot W2A4	ViDiT-Q W2A4	SmoothQuant W2A4
An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography
A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field
A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography
A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography
A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic
A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting

Click any image to enlarge. 6 prompts × 5 methods.

W3A3

W3A3 denotes 3-bit weight quantization and 3-bit activation quantization. Same 6 prompts in the same order across every bit setting.

Prompt	BF16	OrbitQuant W3A3 (Ours)	QuaRot W3A3	ViDiT-Q W3A3	SmoothQuant W3A3
An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography
A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field
A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography
A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography
A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic
A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting

Click any image to enlarge. 6 prompts × 5 methods.

W4A4

W4A4 denotes 4-bit weight quantization and 4-bit activation quantization. Same 6 prompts in the same order across every bit setting.

Prompt	BF16	OrbitQuant W4A4 (Ours)	QuaRot W4A4	ViDiT-Q W4A4	SmoothQuant W4A4
An elderly fisherman with weathered skin, gray beard, wrinkles, piercing blue eyes, golden hour lighting, ultra realistic photography
A golden retriever wearing a red bandana, close-up portrait, detailed fur texture, shallow depth of field
A snowy owl perched on a branch, intricate feathers, winter atmosphere, ultra detailed nature photography
A steaming bowl of ramen with soft-boiled egg, pork chashu, green onions, food photography
A futuristic city street at night with neon advertisements, reflections on wet pavement, cyberpunk aesthetic
A humanoid robot reading a newspaper in a crowded subway station, photorealistic, cinematic lighting

Click any image to enlarge. 6 prompts × 5 methods.

BibTeX

@misc{lee2026orbitquant,
  title         = {OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers},
  author        = {Lee, Donghyun and Chavan, Jitesh and Nguyen, Duy and Huang, Sam and Jiang, Liming and Panda, Priyadarshini and Mertens, Timo and Shukla, Saurabh},
  year          = {2026},
  eprint        = {2607.02461},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2607.02461}
}

References

Chen, Lei, et al. “Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers.” Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Xiao, Guangxuan, et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” International Conference on Machine Learning. PMLR, 2023.
Ashkboos, Saleh, et al. “QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs.” Advances in Neural Information Processing Systems 37 (2024): 100213–100240.
Zhao, Tianchen, et al. “ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.” arXiv preprint arXiv:2406.02540 (2024).
Li, Muyang, et al. “SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models.” arXiv preprint arXiv:2411.05007 (2024).
Zhang, Shaoqiu, et al. “AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization.” arXiv preprint arXiv:2602.09883 (2026).