Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

1Shanghai Jiao Tong University   2Shanghai Artificial Intelligence Laboratory
3Shanghai Innovation Institute   4Nankai University   5University of Science and Technology of China   6Nanjing University   7The University of Sydney   8Peking University   9INSAIT  
†: Corresponding Author

Abstract

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation.

Motivation

Observation 1: Feature trajectory is smooth

smooth

A point in the trajectory is the feature averaged over all tokens in a step. Left: heatmap of pairwise cosine similarity; Right: t-SNE visualization

Assumption 1: It is feasible to predict the next feature from the current feature with light computation


Observation 2: Feature trajectory is controlled by sampled tokens

controlled controlled

PCA visualization of feature trajectories generated with the same prompt and initial random seed. Left: Using a MIGM, we first generate a trajectory (the dark one) and then change the random seed at intermediate steps to generate more samples (the light ones). The randomness in sampling tokens greatly affects the generation process. Right: In contrast, for continuous diffusion with ODE sampling, trajectories generated from the same starting point are always the same, without randomness at intermediate steps.

Assumption 2: It is a must to account for the info of sampled tokens

Method

Formulation: State-space model

State transition

$$\boldsymbol{f}_{t_{i+1}} = \boldsymbol{f}_{t_{i}} + S_\theta(\boldsymbol{f}_{t_{i}}, \boldsymbol{x}_{t_{i}}, t_{i}) + \boldsymbol{\epsilon}$$

Observation (same as vanilla)

$$\boldsymbol{x}_{t_{i+1}} \sim K(\cdot | \boldsymbol{x}_{t_{i}}, \text{softmax}(H(\boldsymbol{f}_{t_{i+1}})), \gamma, t_{i}, t_{i+1})$$

MIGM-Shortcut

framework

Colored blocks and solid lines represent activated computation, while gray blocks and dash lines represent suppressed computation. At a full step, the inference is the same as the vanilla procedure. At a shortcut step, the lightweight shortcut model replaces heavy base model.

Performance

On MaskGIT

maskgit

N denotes the number of total steps, and B denotes the number of full steps.


On Lumina-DiMOO

dimoo

* denotes methods re-implemented by us on Lumina-DiMOO.

dimoo-qual humanstudy

Human study on the Rapidata platform. In nearly half the cases, DiMOO-Shortcut with B = 14 and 4.0× speedup is considered better. Even with B = 9 and 5.8× speedup, the win rate still approaches 40%.

Ablation Study


Importance of incorporating sampling information

ca

When the cross-attention module is replaced by a self-attention module, the shortcut model cannot attend to sampled tokens, thus tending to output over-smoothed images as it is forced to predict the expectation over all possible sampling results.

Model complexity

ablation

The complexity of the shortcut model should strike a balance. Too low complexity cannot accurately model feature dynamics, like existing caching-based methods, while too high complexity wastes computation, like the base model itself. Our default setting is at a sweet spot of complexity.

BibTeX


@misc{migm-shortcut,
      title={Accelerating Masked Image Generation by Learning Latent Controlled Dynamics}, 
      author={Kaiwen Zhu and Quansheng Zeng and Yuandong Pu and Shuo Cao and Xiaohui Li and Yi Xin and Qi Qin and Jiayang Li and Yu Qiao and Jinjin Gu and Yihao Liu},
      year={2026},
      eprint={2602.23996},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.23996}, 
}