Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Abstract
AutoGaze is a lightweight module that reduces redundant video patches before processing by vision transformers or multi-modal large language models, enabling efficient processing of long, high-resolution videos while maintaining performance.
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.
Community
AutoGaze is a lightweight module that removes redundant video patches before being processed downstream by a ViT or MLLM.
Empirically, AutoGaze achieves 4-100× token reduction and up to 19x speedup in ViTs and MLLMs, enabling scaling MLLMs to 1K-frame 4K-resolution videos.
We also introduce HLVid: the first high-resolution, long-form video QA benchmark. On HLVid, an MLLM scaled with AutoGaze outperforms the baseline MLLM by 10.1%.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoPE-VideoLM: Codec Primitives For Efficient Video Language Models (2026)
- Unified Spatio-Temporal Token Scoring for Efficient Video VLMs (2026)
- TrajTok: Learning Trajectory Tokens enables better Video Understanding (2026)
- ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding (2026)
- Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models (2026)
- Adaptive 1D Video Diffusion Autoencoder (2026)
- EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.12254 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash