arxiv:2603.12254

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Published on Mar 12

· Submitted by

Baifeng Shi on Mar 25

Upvote

Authors:

Baifeng Shi ,

Abstract

AutoGaze is a lightweight module that reduces redundant video patches before processing by vision transformers or multi-modal large language models, enabling efficient processing of long, high-resolution videos while maintaining performance.

AI-generated summary

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

View arXiv page View PDF Project page GitHub 118 Add to collection

Community

bfshi

Paper author Paper submitter 1 day ago

AutoGaze is a lightweight module that removes redundant video patches before being processed downstream by a ViT or MLLM.

Empirically, AutoGaze achieves 4-100× token reduction and up to 19x speedup in ViTs and MLLMs, enabling scaling MLLMs to 1K-frame 4K-resolution videos.

We also introduce HLVid: the first high-resolution, long-form video QA benchmark. On HLVid, an MLLM scaled with AutoGaze outperforms the baseline MLLM by 10.1%.