DinoFlow — DINOv3 ViT-S/16 + correlation-augmented SDT optical-flow head

A compact optical-flow decoder on a frozen DINOv3 ViT-S/16 backbone, generalizing the AnyDepth SDT recipe (arXiv:2601.02760) to two-frame flow. Only the small decoder is trained; the DINOv3 encoder is frozen and run on the fly on both frames. Trained on the standard FlowNet/RAFT C+T corpus (FlyingChairs → FlyingThings3D) from blanchon/dinoflow-dataset.

Code: https://github.com/julien-blanchon/dinodepth (src/dinov3_dense).

Architecture

The depth SDT trunk, reused verbatim, with a flow front-end:

The frozen DINOv3 backbone runs on both frames (siamese); a shared softmax WeightedFusion collapses each frame's 4 tapped layers into a feature grid at stride H/16.
A local correlation cost volume (radius 4 → ±64 px, 81 neighbors) plus the feature difference between the two grids form the motion signal.
The AnyDepth trunk — SpatialDetailEnhancer → two learned DySample ×4 stages — upsamples back to full resolution, and a final conv emits 2 channels (u, v) instead of single-channel disparity.

Single forward pass (no RAFT-style iterative refinement). Decoder: 6.88 M parameters.

Usage

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_dense.head import FlowModel, FlowModelConfig

model = FlowModel.from_pretrained(FlowModelConfig(backbone="vits16"))
model.head.load_state_dict(load_file(hf_hub_download("blanchon/dinoflow-model", "flow-vits16.safetensors")))
model.eval()

# image1, image2: float [B, 3, H, W] in [0, 1], H/W multiples of 16
flow = model(image1, image2)   # [B, 2, H, W] -> (u, v) pixels, frame1 -> frame2

Zero-shot benchmark

Full-split evaluation on Sintel (train, 1041 pairs/pass) and KITTI-2015 (train, 200 pairs) with the standard EPE / Fl-all protocol (no alignment), via anyflow-benchmark. EPE in px, lower is better.

Method (C+T)	Sintel-clean EPE	Sintel-final EPE	KITTI-15 EPE	KITTI-15 Fl-all
RAFT	1.43	2.71	5.04	17.4%
FlowFormer	1.01	2.40	4.09	14.7%
SEA-RAFT	1.19	4.11	3.62	12.9%
DinoFlow ViT-S (ours)	3.97	5.06	19.79	61.6%

Pixel accuracy (fraction within threshold): Sintel-clean px3 0.81 / px5 0.87; Sintel-final px3 0.77.

Honest positioning. This is a deliberately lightweight, single-pass probe — a frozen backbone with a tiny decoder and no iterative refinement — so it lands roughly at FlowNet level, well behind the recurrent-refinement SOTA above. The weak spot is KITTI: its large automotive displacements exceed the ±64 px local-correlation range and the GT is sparse LiDAR, the known failure mode of a lite correlation head trained on synthetic C+T only. Sintel (moderate motion, dense GT) is far stronger.

Training

Frozen DINOv3 ViT-S/16, 4 tapped layers [2, 5, 8, 11], ImageNet-normalized input.
24 epochs on combined C+T at 512², global batch 48, AdamW lr 4e-4 (poly decay, 2-epoch warmup), masked-L1 end-point loss with a 400 px flow cap, RAFT-style augmentation, bf16 autocast.
4×GH200, ~3 h. See the GitHub repo for the exact config and anyflow-train command.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train blanchon/dinoflow-model

Paper for blanchon/dinoflow-model

AnyDepth: Depth Estimation Made Easy

Paper • 2601.02760 • Published Jan 6 • 12