RemoteSAM: Towards Segment Anything for Earth Observation Paper • 2505.18022 • Published May 23, 2025 • 1
TV2TV: A Unified Framework for Interleaved Language and Video Generation Paper • 2512.05103 • Published Dec 4, 2025 • 18
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Paper • 2512.10942 • Published Dec 11, 2025 • 45
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning Paper • 2506.04363 • Published Jun 4, 2025 • 1
Planning with Reasoning using Vision Language World Model Paper • 2509.02722 • Published Sep 2, 2025 • 23
Few-shot Adaptation of Multi-modal Foundation Models: A Survey Paper • 2401.01736 • Published Jan 3, 2024
High-Dimension Human Value Representation in Large Language Models Paper • 2404.07900 • Published Apr 11, 2024 • 1
VirtualConductor: Music-driven Conducting Video Generation System Paper • 2108.04350 • Published Jul 28, 2021
Taming Diffusion Models for Music-driven Conducting Motion Generation Paper • 2306.10065 • Published Jun 15, 2023
ProtoCLIP: Prototypical Contrastive Language Image Pretraining Paper • 2206.10996 • Published Jun 22, 2022
Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model Paper • 2309.11000 • Published Sep 20, 2023 • 2
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing Paper • 2306.11029 • Published Jun 19, 2023 • 2