DinoFlow β DINOv3 ViT-S/16 + correlation-augmented SDT optical-flow head
A compact optical-flow decoder on a frozen DINOv3 ViT-S/16 backbone, generalizing the
AnyDepth SDT recipe (arXiv:2601.02760) to two-frame flow. Only the small decoder is trained; the
DINOv3 encoder is frozen and run on the fly on both frames. Trained on the standard FlowNet/RAFT
C+T corpus (FlyingChairs β FlyingThings3D) from
blanchon/dinoflow-dataset.
Code: https://github.com/julien-blanchon/dinodepth (src/dinov3_dense).
Architecture
The depth SDT trunk, reused verbatim, with a flow front-end:
- The frozen DINOv3 backbone runs on both frames (siamese); a shared softmax
WeightedFusioncollapses each frame's 4 tapped layers into a feature grid at stride H/16. - A local correlation cost volume (radius 4 β Β±64 px, 81 neighbors) plus the feature difference between the two grids form the motion signal.
- The AnyDepth trunk β
SpatialDetailEnhancerβ two learnedDySampleΓ4 stages β upsamples back to full resolution, and a final conv emits 2 channels (u, v) instead of single-channel disparity.
Single forward pass (no RAFT-style iterative refinement). Decoder: 6.88 M parameters.
Usage
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from dinov3_dense.head import FlowModel, FlowModelConfig
model = FlowModel.from_pretrained(FlowModelConfig(backbone="vits16"))
model.head.load_state_dict(load_file(hf_hub_download("blanchon/dinoflow-model", "flow-vits16.safetensors")))
model.eval()
# image1, image2: float [B, 3, H, W] in [0, 1], H/W multiples of 16
flow = model(image1, image2) # [B, 2, H, W] -> (u, v) pixels, frame1 -> frame2
Zero-shot benchmark
Full-split evaluation on Sintel (train, 1041 pairs/pass) and KITTI-2015 (train, 200 pairs) with the
standard EPE / Fl-all protocol (no alignment), via anyflow-benchmark. EPE in px, lower is better.
| Method (C+T) | Sintel-clean EPE | Sintel-final EPE | KITTI-15 EPE | KITTI-15 Fl-all |
|---|---|---|---|---|
| RAFT | 1.43 | 2.71 | 5.04 | 17.4% |
| FlowFormer | 1.01 | 2.40 | 4.09 | 14.7% |
| SEA-RAFT | 1.19 | 4.11 | 3.62 | 12.9% |
| DinoFlow ViT-S (ours) | 3.97 | 5.06 | 19.79 | 61.6% |
Pixel accuracy (fraction within threshold): Sintel-clean px3 0.81 / px5 0.87; Sintel-final px3 0.77.
Honest positioning. This is a deliberately lightweight, single-pass probe β a frozen backbone with a tiny decoder and no iterative refinement β so it lands roughly at FlowNet level, well behind the recurrent-refinement SOTA above. The weak spot is KITTI: its large automotive displacements exceed the Β±64 px local-correlation range and the GT is sparse LiDAR, the known failure mode of a lite correlation head trained on synthetic C+T only. Sintel (moderate motion, dense GT) is far stronger.
Training
- Frozen DINOv3 ViT-S/16, 4 tapped layers
[2, 5, 8, 11], ImageNet-normalized input. - 24 epochs on combined C+T at 512Β², global batch 48, AdamW lr 4e-4 (poly decay, 2-epoch warmup), masked-L1 end-point loss with a 400 px flow cap, RAFT-style augmentation, bf16 autocast.
- 4ΓGH200, ~3 h. See the GitHub repo for the exact config and
anyflow-traincommand.