---
license: apache-2.0
language:
- en
pipeline_tag: robotics
library_name: transformers
tags:
- Motus
- Vision-Language-Action
- World-Model
- Bimanual
- Manipulation
- Flowmatching
- Diffusion
- Latent-Action
- UniDiffuser
- MoT
---

# Motus: A Unified Latent Action World Model (Stage 2 Pretrained)

**Motus** is a **unified latent action world model** that leverages existing pretrained models and rich, sharable motion information. Motus introduces a **Mixture-of-Transformers (MoT)** architecture to integrate three experts (understanding, action, and video generation) and adopts a **UniDiffuser-style scheduler** to enable flexible switching between different modeling modes (World Models, Vision-Language-Action Models, Inverse Dynamics Models, Video Generation Models, and Video-Action Joint Prediction Models). Motus further leverages **optical flow** to learn **latent actions** and adopts a **three-phase training pipeline** and **six-layer data pyramid**, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining.

This checkpoint contains the **Stage 2 pretrained** Motus model.

[**Homepage**](https://motus-robotics.github.io/motus) | [**GitHub**](https://github.com/thu-ml/Motus.git) | [**arXiv**](https://arxiv.org/abs/2512.13030) | [**Feishu**](https://motus-robotics.github.io/assets/motus/png/feishu.jpg) | [**WeChat**](https://motus-robotics.github.io/assets/motus/png/wechat.jpg)

---

## Table of Contents

- [Highlights](#highlights)
- [Model Details](#model-details)
- [Hardware & Software Requirements](#hardware--software-requirements)
- [Quickstart (Inference)](#quickstart-inference)
- [Citation](#citation)

---

## Highlights

- **87.02%** average success rate on RoboTwin 2.0 (+15% over X-VLA, +45% over π₀.₅)
- **Unified 5-in-1 Model**: VLA, World Model, IDM, VGM, and Video-Action Joint Prediction
- **Tri-model Joint Attention**: Video, action, and understanding experts share attention layers
- **Latent Action Pretraining**: Pretrained on optical flow-derived latent actions

---

## Model Details

### Architecture

| Component | Base Model | Parameters |
|-----------|------------|------------|
| **VGM (Video Generation Model)** | WAN 2.2 | ~5.00B |
| **VLM (Vision-Language Model)** | Qwen3-VL-2B | ~2.13B |
| **Action Expert** | - | ~641.5M |
| **Understanding Expert** | - | ~253.5M |
| **Total** | - | **~8B** |

### Action Representation

- **Control frequency**: 30Hz (default)
- **Action chunk size**: 48 steps (default)
- **Action dimension**: 14 (bimanual: 7 per arm)

---

## Hardware & Software Requirements

| Mode | VRAM | Recommended GPU |
|------|------|-----------------|
| Inference (with pre-encoded T5) | ~ 24 GB | RTX 5090 |
| Inference (without pre-encoded T5) | ~ 41 GB | A100 (40GB) / A100 (80GB) / H100 / B200 |

---

## Quickstart (Inference)

```python
# Run under Motus repository root
import torch
import yaml
from pathlib import Path

from models.motus import Motus, MotusConfig

# Load config
with open("configs/robotwin.yaml", "r") as f:
    config = yaml.safe_load(f)

# Create model config
model_config = MotusConfig(
    wan_checkpoint_path=config['model']['wan']['checkpoint_path'],
    vae_path=config['model']['wan']['vae_path'],
    wan_config_path=config['model']['wan']['config_path'],
    video_precision=config['model']['wan']['precision'],
    vlm_checkpoint_path=config['model']['vlm']['checkpoint_path'],
    action_dim=config['common']['action_dim'],
    action_state_dim=config['common']['state_dim'],
    num_video_frames=config['common']['num_video_frames'],
    video_height=config['common']['video_height'],
    video_width=config['common']['video_width'],
    load_pretrained_backbones=False,  # Load from checkpoint
)

# Initialize and load checkpoint
device = "cuda:0"
model = Motus(model_config).to(device).eval()
model.load_checkpoint("./pretrained_models/Motus", strict=False)

# Run inference
with torch.no_grad():
    predicted_frames, predicted_actions = model.inference_step(
        first_frame=first_frame_tensor,  # [1, C, H, W]
        state=state_tensor,              # [1, state_dim]
        num_inference_steps=20,
        language_embeddings=t5_embeddings,
        vlm_inputs=[vlm_inputs],
    )

# predicted_actions: [1, action_chunk_size, action_dim]
action_chunk = predicted_actions.squeeze(0).cpu().numpy()
```

---

## Citation

```bibtex
@misc{bi2025motusunifiedlatentaction,
      title={Motus: A Unified Latent Action World Model}, 
      author={Hongzhe Bi and Hengkai Tan and Shenghao Xie and Zeyuan Wang and Shuhe Huang and Haitian Liu and Ruowen Zhao and Yao Feng and Chendong Xiang and Yinze Rong and Hongyan Zhao and Hanyu Liu and Zhizhong Su and Lei Ma and Hang Su and Jun Zhu},
      year={2025},
      eprint={2512.13030},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.13030}, 
}
```