SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models Paper • 2603.16859 • Published 3 days ago • 102
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models Paper • 2603.16859 • Published 3 days ago • 102
Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey Paper • 2603.04445 • Published 24 days ago • 4
Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey Paper • 2603.04445 • Published 24 days ago • 4
Thoth: Mid-Training Bridges LLMs to Time Series Understanding Paper • 2603.01042 • Published 19 days ago
AfriNLLB: Efficient Translation Models for African Languages Paper • 2602.09373 • Published Feb 10 • 2
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Paper • 2602.02185 • Published Feb 2 • 117
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Paper • 2601.22060 • Published Jan 29 • 155
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models Paper • 2601.19834 • Published Jan 27 • 25
Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound Paper • 2512.00883 • Published Nov 30, 2025
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models Paper • 2601.19834 • Published Jan 27 • 25
iFSQ: Improving FSQ for Image Generation with 1 Line of Code Paper • 2601.17124 • Published Jan 23 • 33
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision Paper • 2601.03193 • Published Jan 6 • 49
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision Paper • 2601.03193 • Published Jan 6 • 49
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Paper • 2506.24102 • Published Jun 30, 2025
One Flight Over the Gap: A Survey from Perspective to Panoramic Vision Paper • 2509.04444 • Published Sep 4, 2025
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models Paper • 2508.12081 • Published Aug 16, 2025
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training Paper • 2510.11712 • Published Oct 13, 2025 • 31
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Paper • 2510.18876 • Published Oct 21, 2025 • 37