SO-Bench: A Structural Output Evaluation of Multimodal LLMs Paper • 2511.21750 • Published Nov 23, 2025 • 6
NarrativeTrack: Evaluating Video Language Models Beyond the Frame Paper • 2601.01095 • Published 15 days ago • 6
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory Paper • 2410.10813 • Published Oct 14, 2024 • 14
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2, 2024 • 26
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2, 2024 • 26
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? Paper • 2409.07703 • Published Sep 12, 2024 • 67
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? Paper • 2409.07703 • Published Sep 12, 2024 • 67
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 133
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
LASER: LLM Agent with State-Space Exploration for Web Navigation Paper • 2309.08172 • Published Sep 15, 2023 • 13
LASER: LLM Agent with State-Space Exploration for Web Navigation Paper • 2309.08172 • Published Sep 15, 2023 • 13