ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Abstract
ReFusion, a novel masked diffusion model, improves performance and efficiency by using slot-based parallel decoding, achieving superior results compared to autoregressive models and traditional masked diffusion models.
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.
Community
ReFusion is a masked diffusion model that achieves superior performance and efficiency, featuring full KV cache reuse while simultaneously supporting any-order generation.
ReFusion is a really elegant middle ground between ARMs and masked diffusion. Moving parallelism from tokens to slots neatly sidesteps two of the biggest MDM pain points: KV-cache incompatibility and incoherent dependency learning. The plan-and-infill split feels especially powerful — diffusion for global structure, AR for local precision. This looks less like an incremental speedup and more like a new decoding primitive for long-form generation. Excited to see how this scales to reasoning-heavy and multimodal settings.
Thanks for the kind words and the great summary of our work! We're excited about those directions too and plan to explore long reasoning tasks in our future work.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CDLM: Consistency Diffusion Language Models For Faster Sampling (2025)
- From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models (2025)
- From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs (2025)
- Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model (2025)
- Planned Diffusion (2025)
- Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models (2025)
- Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper