File size: 11,294 Bytes
0ee8f60 e9207c6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
<div align="center">
<br>
<h1>DOSOD<br>
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
</h1>
<br>
<a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,π </span></sup>,
<a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,π§</span></sup>,
<a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>,
<a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>,
<a href="">Wei Sui</a><sup><span>1</span></sup>,
<a href="">Cong Wang</a><sup><span>1</span></sup>,
<a href="www.amnrlab.org">Song Liu</a><sup><span>4,π§</span></sup>
<br>
\* Equal contribution, π Project lead, π§ Corresponding author
<sup>1</sup> D-Robotics, <br>
<sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br>
<sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br>
<sup>4</sup> the School of Information Science and Technology, ShanghaiTech
University
[](https://arxiv.org/abs/2412.14680)
[](LICENSE)
</div>
</div>
## 1. Introduction
### 1.1 Brief Introduction of DOSOD
Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
open-vocabulary detection has been extensively applied in various scenarios.
Real-time open-vocabulary detection has attracted significant attention.
In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a
practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
within which the detector learns the region representations of class-agnostic proposals.
Cross-modality features are directly aligned in the joint space,
avoiding the complex feature interactions and thereby improving computational efficiency.
DOSOD functions like a traditional closed-set detector during the testing phase,
effectively bridging the gap between closed-set and open-set detection.
## 2. Model Overview
Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
All pre-trained models are released.
### 2.1 Zero-shot Evaluation on LVIS minival
<div><font size=2>
| model | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:|
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(repo)</div> | O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(repo)</div> | O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(repo)</div> | O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div> | O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div> | O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div> | O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
| [YOLO-Worldv2-S]() | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
| [YOLO-Worldv2-M]() | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
| [YOLO-Worldv2-L]() | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
| [DOSOD-S]() | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | [HF Checkpoints π€](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth) |
| [DOSOD-M]() | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | [HF Checkpoints π€](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth) |
| [DOSOD-L]() | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | [HF Checkpoints π€](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth) |
> NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.
</font>
</div>
### 2.2 Zero-shot Inference on COCO dataset
<div><font size=2>
| model | Pre-train Data | Size | AP | AP<sub>50</sub> | AP<sub>75</sub> |
|:--------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:----:|:---------------:|:---------------:|
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div> | O365+GoldG | 640 | 37.6 | 52.3 | 40.7 |
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div> | O365+GoldG | 640 | 42.8 | 58.3 | 46.4 |
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div> | O365+GoldG | 640 | 44.4 | 59.8 | 48.3 |
| [YOLO-Worldv2-S]() | O365+GoldG | 640 | 37.5 | 52.0 | 40.7 |
| [YOLO-Worldv2-M]() | O365+GoldG | 640 | 42.8 | 58.2 | 46.7 |
| [YOLO-Worldv2-L]() | O365+GoldG | 640 | 45.4 | 61.0 | 49.4 |
| [DOSOD-S]() | O365+GoldG | 640 | 36.1 | 51.0 | 39.1 |
| [DOSOD-M]() | O365+GoldG | 640 | 41.7 | 57.1 | 45.2 |
| [DOSOD-L]() | O365+GoldG | 640 | 44.6 | 60.5 | 48.4 |
</font>
</div>
### 2.3 Latency On RTX 4090
We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode.
All models are re-parameterized with 80 categories from COCO.
Log info can be found by clicking the FPS.
| model | Params | FPS |
|:--------------:|:------:|:---------------------------------------:|
| YOLO-Worldv1-S | 13.32M | 1007 |
| YOLO-Worldv1-M | 28.93M | 702 |
| YOLO-Worldv1-L | 47.38M | 494 |
| YOLO-Worldv2-S | 12.66M | 1221 |
| YOLO-Worldv2-M | 28.20M | 771 |
| YOLO-Worldv2-L | 46.62M | 553 |
| DOSOD-S | 11.48M | 1582 |
| DOSOD-M | 26.31M | 922 |
| DOSOD-L | 44.19M | 632 |
> NOTE: FPS = 1000 / GPU Compute Time[mean]
### 2.4 Latency On RDK X5
We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5).
The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.
| model | FPS (1 thread) | FPS (8 threads) |
|:-------------------------------:|:--------------:|:---------------:|
| YOLO-Worldv2-S<br/>(INT16/INT8) | 5.962/11.044 | 6.386/12.590 |
| YOLO-Worldv2-M<br/>(INT16/INT8) | 4.136/7.290 | 4.340/7.930 |
| YOLO-Worldv2-L<br/>(INT16/INT8) | 2.958/5.377 | 3.060/5.720 |
| DOSOD-S<br/>(INT16/INT8) | 12.527/31.020 | 14.657/47.328 |
| DOSOD-M<br/>(INT16/INT8) | 8.531/20.238 | 9.471/26.36 |
| DOSOD-L<br/>(INT16/INT8) | 5.663/12.799 | 6.069/14.939 |
## 3 Usage
Float model training and reparameterizating: https://github.com/D-Robotics-AI-Lab/DOSOD
Runtime usage on RDK: https://github.com/D-Robotics/hobot_dosod
|