SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper
•
2506.01844
•
Published
•
148
This model is a fine-tuned version of lerobot/smolvla_base on the lerobot/pusht dataset.
Since smolvla_base was trained with 3 cameras and 6-dimensional actions, but PushT uses 1 camera and 2-dimensional actions, special options are required for both training and evaluation.
| Option | Description |
|---|---|
--policy.empty_cameras=2 |
Fill missing camera2 and camera3 with dummy images |
--rename_map='{"observation.image": "observation.images.camera1"}' |
Map PushT's camera to SmolVLA's expected format |
--policy.output_features='{"action": {"type": "ACTION", "shape": [2]}}' |
Required for evaluation: Override action dimension from 6 to 2 |
uv run lerobot-eval \
--policy.path=naonaon/smolvla_pusht \
--env.type=pusht \
--eval.n_episodes=50 \
--eval.batch_size=50 \
--policy.empty_cameras=2 \
--policy.output_features='{"action": {"type": "ACTION", "shape": [2]}}' \
--rename_map='{"observation.image": "observation.images.camera1"}'
TOKENIZERS_PARALLELISM=false uv run lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/pusht \
--dataset.image_transforms.enable=false \
--batch_size=24 \
--steps=40000 \
--output_dir=outputs/pusht_smolvla_finetune \
--job_name=pusht_smolvla_finetune \
--policy.empty_cameras=2 \
--rename_map='{"observation.image": "observation.images.camera1"}'
| Step | Loss | Gradient Norm |
|---|---|---|
| 200 | 0.060 | 0.685 |
| 5,000 | 0.021 | 0.266 |
| 20,000 | 0.009 | 0.131 |
| 40,000 | 0.008 | 0.091 |
--policy.output_features override at evaluation timeIf you use this model, please cite the original SmolVLA paper:
@article{smolvla2024,
title={SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning},
author={Hugging Face Team},
year={2024}
}
Base model
lerobot/smolvla_base