SmolVLA fine-tuned on PushT

This model is a fine-tuned version of lerobot/smolvla_base on the lerobot/pusht dataset.

Model Description

Important Usage Notes

Since smolvla_base was trained with 3 cameras and 6-dimensional actions, but PushT uses 1 camera and 2-dimensional actions, special options are required for both training and evaluation.

Required Options

Option Description
--policy.empty_cameras=2 Fill missing camera2 and camera3 with dummy images
--rename_map='{"observation.image": "observation.images.camera1"}' Map PushT's camera to SmolVLA's expected format
--policy.output_features='{"action": {"type": "ACTION", "shape": [2]}}' Required for evaluation: Override action dimension from 6 to 2

Usage

Evaluation

uv run lerobot-eval \
    --policy.path=naonaon/smolvla_pusht \
    --env.type=pusht \
    --eval.n_episodes=50 \
    --eval.batch_size=50 \
    --policy.empty_cameras=2 \
    --policy.output_features='{"action": {"type": "ACTION", "shape": [2]}}' \
    --rename_map='{"observation.image": "observation.images.camera1"}'

Training (to reproduce)

TOKENIZERS_PARALLELISM=false uv run lerobot-train \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=lerobot/pusht \
    --dataset.image_transforms.enable=false \
    --batch_size=24 \
    --steps=40000 \
    --output_dir=outputs/pusht_smolvla_finetune \
    --job_name=pusht_smolvla_finetune \
    --policy.empty_cameras=2 \
    --rename_map='{"observation.image": "observation.images.camera1"}'

Training Details

Hardware

  • GPU: NVIDIA RTX 5070
  • Training time: 7 hours for 40K steps (0.64 sec/step)

Training Curve

Step Loss Gradient Norm
200 0.060 0.685
5,000 0.021 0.266
20,000 0.009 0.131
40,000 0.008 0.091

Limitations

  • This model was fine-tuned on a 2D simulation environment (PushT), which is different from the real robot scenarios SmolVLA was originally designed for
  • The action dimension mismatch requires the --policy.output_features override at evaluation time
  • Performance on PushT may be limited compared to policies designed specifically for this task (e.g., Diffusion Policy)

References

Citation

If you use this model, please cite the original SmolVLA paper:

@article{smolvla2024,
  title={SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning},
  author={Hugging Face Team},
  year={2024}
}
Downloads last month
13
Video Preview
loading

Model tree for naonaon/smolvla_pusht

Finetuned
(2760)
this model

Dataset used to train naonaon/smolvla_pusht

Paper for naonaon/smolvla_pusht