Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Abstract
A Spatial-Aware VLA Pretraining paradigm improves 3D spatial understanding in robots by aligning 2D visual inputs with 3D actions using dual-encoder architecture with a 3D visual encoder.
Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.
Community
We propose VIPA-VLA , which learns 2D–to–3D visual–physical grounding from human videos with Spatial-Aware VLA Pretraining, enabling robot policies with stronger spatial understanding and generalization.
Website: https://beingbeyond.github.io/VIPA-VLA
arXiv: https://arxiv.org/abs/2512.13080
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models (2025)
- GLaD: Geometric Latent Distillation for Vision-Language-Action Models (2025)
- From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors (2025)
- Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation (2025)
- PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention (2025)
- VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation (2025)
- METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper