A point in the trajectory is the feature averaged over all tokens in a step. Left: heatmap of pairwise cosine similarity; Right: t-SNE visualization
| Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. |
A point in the trajectory is the feature averaged over all tokens in a step. Left: heatmap of pairwise cosine similarity; Right: t-SNE visualization
PCA visualization of feature trajectories generated with the same prompt and initial random seed. Left: Using a MIGM, we first generate a trajectory (the dark one) and then change the random seed at intermediate steps to generate more samples (the light ones). The randomness in sampling tokens greatly affects the generation process. Right: In contrast, for continuous diffusion with ODE sampling, trajectories generated from the same starting point are always the same, without randomness at intermediate steps.
$$\boldsymbol{f}_{t_{i+1}} = \boldsymbol{f}_{t_{i}} + S_\theta(\boldsymbol{f}_{t_{i}}, \boldsymbol{x}_{t_{i}}, t_{i}) + \boldsymbol{\epsilon}$$
$$\boldsymbol{x}_{t_{i+1}} \sim K(\cdot | \boldsymbol{x}_{t_{i}}, \text{softmax}(H(\boldsymbol{f}_{t_{i+1}})), \gamma, t_{i}, t_{i+1})$$
Colored blocks and solid lines represent activated computation, while gray blocks and dash lines represent suppressed computation. At a full step, the inference is the same as the vanilla procedure. At a shortcut step, the lightweight shortcut model replaces heavy base model.
N denotes the number of total steps, and B denotes the number of full steps.
* denotes methods re-implemented by us on Lumina-DiMOO.
Human study on the Rapidata platform. In nearly half the cases, DiMOO-Shortcut with B = 14 and 4.0× speedup is considered better. Even with B = 9 and 5.8× speedup, the win rate still approaches 40%.
When the cross-attention module is replaced by a self-attention module, the shortcut model cannot attend to sampled tokens, thus tending to output over-smoothed images as it is forced to predict the expectation over all possible sampling results.
The complexity of the shortcut model should strike a balance. Too low complexity cannot accurately model feature dynamics, like existing caching-based methods, while too high complexity wastes computation, like the base model itself. Our default setting is at a sweet spot of complexity.
@misc{migm-shortcut,
title={Accelerating Masked Image Generation by Learning Latent Controlled Dynamics},
author={Kaiwen Zhu and Quansheng Zeng and Yuandong Pu and Shuo Cao and Xiaohui Li and Yi Xin and Qi Qin and Jiayang Li and Yu Qiao and Jinjin Gu and Yihao Liu},
year={2026},
eprint={2602.23996},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.23996},
}