From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper β’ 2511.18538 β’ Published 14 days ago β’ 239
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper β’ 2511.03146 β’ Published Nov 5 β’ 7
RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization Paper β’ 2511.04285 β’ Published Nov 6 β’ 7
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper β’ 2511.07250 β’ Published 27 days ago β’ 17
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains Paper β’ 2511.10984 β’ Published 23 days ago β’ 4
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models Paper β’ 2406.05862 β’ Published Jun 9, 2024 β’ 4
RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies Paper β’ 2510.17950 β’ Published Oct 20 β’ 7
Scaling Latent Reasoning via Looped Language Models Paper β’ 2510.25741 β’ Published Oct 29 β’ 219
VideoScore2: Think before You Score in Generative Video Evaluation Paper β’ 2509.22799 β’ Published Sep 26 β’ 25
Towards Personalized Deep Research: Benchmarks and Evaluations Paper β’ 2509.25106 β’ Published Sep 29 β’ 29
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution Paper β’ 2509.25301 β’ Published Sep 29 β’ 19