Papers
arxiv:2512.13586

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Published on Dec 15
· Submitted by Jia-Nan Li on Dec 16
#2 Paper of the day
Authors:
,

Abstract

ReFusion, a novel masked diffusion model, improves performance and efficiency by using slot-based parallel decoding, achieving superior results compared to autoregressive models and traditional masked diffusion models.

AI-generated summary

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.

Community

Paper author Paper submitter
edited 1 day ago

ReFusion is a masked diffusion model that achieves superior performance and efficiency, featuring full KV cache reuse while simultaneously supporting any-order generation.

ReFusion is a really elegant middle ground between ARMs and masked diffusion. Moving parallelism from tokens to slots neatly sidesteps two of the biggest MDM pain points: KV-cache incompatibility and incoherent dependency learning. The plan-and-infill split feels especially powerful — diffusion for global structure, AR for local precision. This looks less like an incremental speedup and more like a new decoding primitive for long-form generation. Excited to see how this scales to reasoning-heavy and multimodal settings.

·

Thanks for the kind words and the great summary of our work! We're excited about those directions too and plan to explore long reasoning tasks in our future work.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.13586 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.13586 in a Space README.md to link it from this page.

Collections including this paper 5