ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.

Model Summary

ROMA is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both Reactive (Question Answering) and Proactive (Event-Driven Alert, Real-Time Narration) capabilities within a single framework.

ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide when to speak based on the continuous audio-visual stream.

Paper: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
Project Page: Link
Repository: [Github (Coming Soon)]

Citation

If you find this project useful, please cite:

@article{tian2026roma,
  title={ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding},
  author={Tian, Xueyun and Li, Wei and Xu, Bingbing and Dong, Heng and Wang, Yuanzhuo and Shen, Huawei},
  journal={arXiv preprint arXiv:2601.10323},
  year={2026}
}

Downloads last month: 37

Safetensors

Model size

11B params

Tensor type

F32

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for EurekaTian/ROMA

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Paper • 2601.10323 • Published 4 days ago