---
license: apache-2.0
tags:
- diffusion
- video-generation
- multi-scene
- autoregressive
- transformer
- computer-vision
- cvpr2025
model-index:
- name: Mask²DiT
results: []
---
# Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation (CVPR 2025)
_**[Tianhao Qi*](https://tianhao-qi.github.io/), [Jianlong Yuan✝](https://scholar.google.com.tw/citations?user=vYe1uCQAAAAJ&hl=zh-CN), [Wanquan Feng](https://wanquanf.github.io/), [Shancheng Fang✉](https://scholar.google.com/citations?user=8Efply8AAAAJ&hl=zh-CN), [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ&hl=en&authuser=1),
[SiYu Zhou](https://openreview.net/profile?id=~SiYu_Zhou3), [Qian He](https://scholar.google.com/citations?view_op=list_works&hl=zh-CN&authuser=1&user=9rWWCgUAAAAJ), [Hongtao Xie](https://imcc.ustc.edu.cn/_upload/tpl/0d/13/3347/template3347/xiehongtao.html), [Yongdong Zhang](https://scholar.google.com.hk/citations?user=hxGs4ukAAAAJ&hl=zh-CN)**_
(*Works done during the internship at Bytedance Intelligent Creation, ✝Project lead, ✉Corresponding author)
From University of Science and Technology of China, ByteDance Intelligent Creation and Yuanshi Inc.
## 🔆 Introduction
**TL;DR:** We present **Mask²DiT**, a novel dual-mask-based diffusion transformer designed for multi-scene long video generation. It enables both **synthesizing a fixed number of scenes** and **auto-regressively expanding new scenes**, advancing the scalability and continuity of long video synthesis.