KFFocus: Highlighting Keyframes for Enhanced Video Understanding Paper • 2508.08989 • Published Aug 12
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens Paper • 2412.09919 • Published Dec 13, 2024 • 1
GUI Action Narrator: Where and When Did That Action Take Place? Paper • 2406.13719 • Published Jun 19, 2024
Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation Paper • 2503.03492 • Published Mar 5