Papers
arxiv:2603.12254

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Published on Mar 12
· Submitted by
Baifeng Shi
on Mar 25
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

AutoGaze is a lightweight module that reduces redundant video patches before processing by vision transformers or multi-modal large language models, enabling efficient processing of long, high-resolution videos while maintaining performance.

AI-generated summary

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

Community

Paper author Paper submitter

AutoGaze is a lightweight module that removes redundant video patches before being processed downstream by a ViT or MLLM.

Empirically, AutoGaze achieves 4-100× token reduction and up to 19x speedup in ViTs and MLLMs, enabling scaling MLLMs to 1K-frame 4K-resolution videos.

We also introduce HLVid: the first high-resolution, long-form video QA benchmark. On HLVid, an MLLM scaled with AutoGaze outperforms the baseline MLLM by 10.1%.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.12254
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 2