SmolVLA fine-tuned on PushT

This model is a fine-tuned version of lerobot/smolvla_base on the lerobot/pusht dataset.

Model Description

Base Model: lerobot/smolvla_base (SmolVLA 450M)
Dataset: lerobot/pusht
Training Steps: 40,000
Batch Size: 24
Final Loss: ~0.008

Important Usage Notes

Since smolvla_base was trained with 3 cameras and 6-dimensional actions, but PushT uses 1 camera and 2-dimensional actions, special options are required for both training and evaluation.

Required Options

Option	Description
`--policy.empty_cameras=2`	Fill missing camera2 and camera3 with dummy images
`--rename_map='{"observation.image": "observation.images.camera1"}'`	Map PushT's camera to SmolVLA's expected format
`--policy.output_features='{"action": {"type": "ACTION", "shape": [2]}}'`	Required for evaluation: Override action dimension from 6 to 2

Usage

Evaluation

uv run lerobot-eval \
    --policy.path=naonaon/smolvla_pusht \
    --env.type=pusht \
    --eval.n_episodes=50 \
    --eval.batch_size=50 \
    --policy.empty_cameras=2 \
    --policy.output_features='{"action": {"type": "ACTION", "shape": [2]}}' \
    --rename_map='{"observation.image": "observation.images.camera1"}'

Training (to reproduce)

TOKENIZERS_PARALLELISM=false uv run lerobot-train \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=lerobot/pusht \
    --dataset.image_transforms.enable=false \
    --batch_size=24 \
    --steps=40000 \
    --output_dir=outputs/pusht_smolvla_finetune \
    --job_name=pusht_smolvla_finetune \
    --policy.empty_cameras=2 \
    --rename_map='{"observation.image": "observation.images.camera1"}'

Training Details

Hardware

GPU: NVIDIA RTX 5070
Training time: ~~7 hours for 40K steps (~~0.64 sec/step)

Training Curve

Step	Loss	Gradient Norm
200	0.060	0.685
5,000	0.021	0.266
20,000	0.009	0.131
40,000	0.008	0.091

Limitations

This model was fine-tuned on a 2D simulation environment (PushT), which is different from the real robot scenarios SmolVLA was originally designed for
The action dimension mismatch requires the --policy.output_features override at evaluation time
Performance on PushT may be limited compared to policies designed specifically for this task (e.g., Diffusion Policy)

References

Citation

If you use this model, please cite the original SmolVLA paper:

@article{smolvla2024,
  title={SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning},
  author={Hugging Face Team},
  year={2024}
}

Downloads last month: 13

Video Preview

Robotics

Model tree for naonaon/smolvla_pusht

Base model

lerobot/smolvla_base

Finetuned

(2760)

this model

Dataset used to train naonaon/smolvla_pusht

Paper for naonaon/smolvla_pusht

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2, 2025 • 148