Introduction

You can find the history behind this work in this blog post: https://www.gradients.zone/blog/a-super-small-vision-language-model/

datasets

  • "localized_narratives" part from the_cauldron (200k items)
  • private dataset (30k items)

nanoVLM is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model.

For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M.

Usage:

Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM. Follow the install instructions and run the following code:

from models.vision_language_model import VisionLanguageModel

model = VisionLanguageModel.from_pretrained("sbrzz/nanoVLM")
Downloads last month
16
Safetensors
Model size
15.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support