arxiv:2512.13586

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Published on Dec 15

· Submitted by

Jia-Nan Li on Dec 16

#2 Paper of the day

Upvote

Authors:

Jia-Nan Li ,

Jian Guan ,

Abstract

ReFusion, a novel masked diffusion model, improves performance and efficiency by using slot-based parallel decoding, achieving superior results compared to autoregressive models and traditional masked diffusion models.

AI-generated summary

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.

View arXiv page View PDF GitHub 24 Add to collection

Community

JinaLeejnl

Paper author Paper submitter 1 day ago

•

edited 1 day ago

ReFusion is a masked diffusion model that achieves superior performance and efficiency, featuring full KV cache reuse while simultaneously supporting any-order generation.

s21mind

about 16 hours ago

ReFusion is a really elegant middle ground between ARMs and masked diffusion. Moving parallelism from tokens to slots neatly sidesteps two of the biggest MDM pain points: KV-cache incompatibility and incoherent dependency learning. The plan-and-infill split feels especially powerful — diffusion for global structure, AR for local precision. This looks less like an incremental speedup and more like a new decoding primitive for long-form generation. Excited to see how this scales to reasoning-heavy and multimodal settings.

JinaLeejnl

Paper author about 10 hours ago

Thanks for the kind words and the great summary of our work! We're excited about those directions too and plan to explore long reasoning tasks in our future work.