DinoFlow β€” DINOv3 ViT-S/16 + correlation-augmented SDT optical-flow head

A compact optical-flow decoder on a frozen DINOv3 ViT-S/16 backbone, generalizing the AnyDepth SDT recipe (arXiv:2601.02760) to two-frame flow. Only the small decoder is trained; the DINOv3 encoder is frozen and run on the fly on both frames. Trained on the standard FlowNet/RAFT C+T corpus (FlyingChairs β†’ FlyingThings3D) from blanchon/dinoflow-dataset.

Code: https://github.com/julien-blanchon/dinodepth (src/dinov3_dense).

Architecture

The depth SDT trunk, reused verbatim, with a flow front-end:

  1. The frozen DINOv3 backbone runs on both frames (siamese); a shared softmax WeightedFusion collapses each frame's 4 tapped layers into a feature grid at stride H/16.
  2. A local correlation cost volume (radius 4 β†’ Β±64 px, 81 neighbors) plus the feature difference between the two grids form the motion signal.
  3. The AnyDepth trunk β€” SpatialDetailEnhancer β†’ two learned DySample Γ—4 stages β€” upsamples back to full resolution, and a final conv emits 2 channels (u, v) instead of single-channel disparity.

Single forward pass (no RAFT-style iterative refinement). Decoder: 6.88 M parameters.

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_dense.head import FlowModel, FlowModelConfig

model = FlowModel.from_pretrained(FlowModelConfig(backbone="vits16"))
model.head.load_state_dict(load_file(hf_hub_download("blanchon/dinoflow-model", "flow-vits16.safetensors")))
model.eval()

# image1, image2: float [B, 3, H, W] in [0, 1], H/W multiples of 16
flow = model(image1, image2)   # [B, 2, H, W] -> (u, v) pixels, frame1 -> frame2

Zero-shot benchmark

Full-split evaluation on Sintel (train, 1041 pairs/pass) and KITTI-2015 (train, 200 pairs) with the standard EPE / Fl-all protocol (no alignment), via anyflow-benchmark. EPE in px, lower is better.

Method (C+T) Sintel-clean EPE Sintel-final EPE KITTI-15 EPE KITTI-15 Fl-all
RAFT 1.43 2.71 5.04 17.4%
FlowFormer 1.01 2.40 4.09 14.7%
SEA-RAFT 1.19 4.11 3.62 12.9%
DinoFlow ViT-S (ours) 3.97 5.06 19.79 61.6%

Pixel accuracy (fraction within threshold): Sintel-clean px3 0.81 / px5 0.87; Sintel-final px3 0.77.

Honest positioning. This is a deliberately lightweight, single-pass probe β€” a frozen backbone with a tiny decoder and no iterative refinement β€” so it lands roughly at FlowNet level, well behind the recurrent-refinement SOTA above. The weak spot is KITTI: its large automotive displacements exceed the Β±64 px local-correlation range and the GT is sparse LiDAR, the known failure mode of a lite correlation head trained on synthetic C+T only. Sintel (moderate motion, dense GT) is far stronger.

Training

  • Frozen DINOv3 ViT-S/16, 4 tapped layers [2, 5, 8, 11], ImageNet-normalized input.
  • 24 epochs on combined C+T at 512Β², global batch 48, AdamW lr 4e-4 (poly decay, 2-epoch warmup), masked-L1 end-point loss with a 400 px flow cap, RAFT-style augmentation, bf16 autocast.
  • 4Γ—GH200, ~3 h. See the GitHub repo for the exact config and anyflow-train command.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train blanchon/dinoflow-model

Paper for blanchon/dinoflow-model