File size: 11,294 Bytes
0ee8f60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9207c6
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
<div align="center">
<br>
<h1>DOSOD<br>
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
</h1>
<br>
<a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,🌟 </span></sup>, 
<a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,πŸ“§</span></sup>,
<a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>,
<a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>,
<a href="">Wei Sui</a><sup><span>1</span></sup>,
<a href="">Cong Wang</a><sup><span>1</span></sup>,
<a href="www.amnrlab.org">Song Liu</a><sup><span>4,πŸ“§</span></sup>
<br>

\* Equal contribution, 🌟 Project lead, πŸ“§ Corresponding author

<sup>1</sup> D-Robotics, <br>
<sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br>
<sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br>
<sup>4</sup> the School of Information Science and Technology, ShanghaiTech
University

[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2412.14680)
[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
</div>
</div>

## 1. Introduction

### 1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
open-vocabulary detection has been extensively applied in various scenarios.
Real-time open-vocabulary detection has attracted significant attention.
In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a
practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
within which the detector learns the region representations of class-agnostic proposals.
Cross-modality features are directly aligned in the joint space,
avoiding the complex feature interactions and thereby improving computational efficiency.
DOSOD functions like a traditional closed-set detector during the testing phase,
effectively bridging the gap between closed-set and open-set detection.

## 2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
All pre-trained models are released.

### 2.1 Zero-shot Evaluation on LVIS minival

<div><font size=2>

|                                                                                     model                                                                                      | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> |                                                              weights                                                               |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:|
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(repo)</div> | O365+GoldG     | 640  |       24.3        |      16.6      |      22.1      |      27.7      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(repo)</div> | O365+GoldG     | 640  |       28.6        |      19.7      |      26.6      |      31.9      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | 
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(repo)</div> | O365+GoldG     | 640  |       32.5        |      22.3      |      30.6      |      36.1      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | 
|                                                      <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       26.2        |      19.1      |      23.6      |      29.8      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|                                                      <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       31.0        |      23.8      |      29.2      |      33.9      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | 
|                                                      <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       35.0        |      27.1      |      32.8      |      38.3      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
|                              [YOLO-Worldv2-S]()                              | O365+GoldG     | 640  |       22.7        |      16.3      |      20.8      |      25.5      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|                              [YOLO-Worldv2-M]()                              | O365+GoldG     | 640  |       30.0        |      25.0      |      27.2      |      33.4      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) | 
|                              [YOLO-Worldv2-L]()                              | O365+GoldG     | 640  |       33.0        |      22.6      |      32.0      |      35.8      | [HF Checkpoints πŸ€—](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) | 
|                                     [DOSOD-S]()                                     | O365+GoldG     | 640  |       26.7        |      19.9      |      25.1      |      29.3      |                      [HF Checkpoints πŸ€—](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth)                      |
|                                     [DOSOD-M]()                                     | O365+GoldG     | 640  |       31.3        |      25.7      |      29.6      |      33.7      |                              [HF Checkpoints πŸ€—](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth)                                                                                                       | 
|                                     [DOSOD-L]()                                     | O365+GoldG     | 640  |       34.4        |      29.1      |      32.6      |      36.6      |                                       [HF Checkpoints πŸ€—](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth)                                                                                              | 

> NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.

</font>
</div>

### 2.2 Zero-shot Inference on COCO dataset

<div><font size=2>

|                                                        model                                                         | Pre-train Data | Size |  AP  | AP<sub>50</sub> | AP<sub>75</sub> | 
|:--------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:----:|:---------------:|:---------------:|
|                         <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div>                         | O365+GoldG     | 640  | 37.6 |      52.3       |      40.7       |
|                         <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div>                         | O365+GoldG     | 640  | 42.8 |      58.3       |      46.4       |
|                         <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div>                         | O365+GoldG     | 640  | 44.4 |      59.8       |      48.3       |
| [YOLO-Worldv2-S]() | O365+GoldG     | 640  | 37.5 |      52.0       |      40.7       |
| [YOLO-Worldv2-M]() | O365+GoldG     | 640  | 42.8 |      58.2       |      46.7       | 
| [YOLO-Worldv2-L]() | O365+GoldG     | 640  | 45.4 |      61.0       |      49.4       | 
|        [DOSOD-S]()        | O365+GoldG     | 640  | 36.1 |      51.0       |      39.1       |
|        [DOSOD-M]()        | O365+GoldG     | 640  | 41.7 |      57.1       |      45.2       | 
|        [DOSOD-L]()        | O365+GoldG     | 640  | 44.6 |      60.5       |      48.4       | 

</font>
</div>

### 2.3 Latency On RTX 4090

We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode.
All models are re-parameterized with 80 categories from COCO.
Log info can be found by clicking the FPS.

|     model      | Params |                   FPS                   |
|:--------------:|:------:|:---------------------------------------:|
| YOLO-Worldv1-S | 13.32M | 1007 |
| YOLO-Worldv1-M | 28.93M | 702  |
| YOLO-Worldv1-L | 47.38M | 494  |
| YOLO-Worldv2-S | 12.66M | 1221 |
| YOLO-Worldv2-M | 28.20M | 771 |
| YOLO-Worldv2-L | 46.62M | 553  |
|    DOSOD-S     | 11.48M |    1582    |
|    DOSOD-M     | 26.31M |     922     |
|    DOSOD-L     | 44.19M |     632     |

> NOTE: FPS = 1000 / GPU Compute Time[mean]

### 2.4 Latency On RDK X5

We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5).
The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

|              model              | FPS (1 thread) | FPS (8 threads) |
|:-------------------------------:|:--------------:|:---------------:|
| YOLO-Worldv2-S<br/>(INT16/INT8) |  5.962/11.044  |  6.386/12.590   |
| YOLO-Worldv2-M<br/>(INT16/INT8) |  4.136/7.290   |   4.340/7.930   |
| YOLO-Worldv2-L<br/>(INT16/INT8) |  2.958/5.377   |   3.060/5.720   |
|    DOSOD-S<br/>(INT16/INT8)     | 12.527/31.020  |  14.657/47.328  |
|    DOSOD-M<br/>(INT16/INT8)     |  8.531/20.238  |   9.471/26.36   |
|    DOSOD-L<br/>(INT16/INT8)     |  5.663/12.799  |  6.069/14.939   |

## 3 Usage

Float model training and reparameterizating: https://github.com/D-Robotics-AI-Lab/DOSOD

Runtime usage on RDK: https://github.com/D-Robotics/hobot_dosod