Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,126 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div align="center">
|
| 2 |
+
<br>
|
| 3 |
+
<h1>DOSOD<br>
|
| 4 |
+
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
|
| 5 |
+
</h1>
|
| 6 |
+
<br>
|
| 7 |
+
<a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,π </span></sup>,
|
| 8 |
+
<a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,π§</span></sup>,
|
| 9 |
+
<a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>,
|
| 10 |
+
<a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>,
|
| 11 |
+
<a href="">Wei Sui</a><sup><span>1</span></sup>,
|
| 12 |
+
<a href="">Cong Wang</a><sup><span>1</span></sup>,
|
| 13 |
+
<a href="www.amnrlab.org">Song Liu</a><sup><span>4,π§</span></sup>
|
| 14 |
+
<br>
|
| 15 |
+
|
| 16 |
+
\* Equal contribution, π Project lead, π§ Corresponding author
|
| 17 |
+
|
| 18 |
+
<sup>1</sup> D-Robotics, <br>
|
| 19 |
+
<sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br>
|
| 20 |
+
<sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br>
|
| 21 |
+
<sup>4</sup> the School of Information Science and Technology, ShanghaiTech
|
| 22 |
+
University
|
| 23 |
+
|
| 24 |
+
[](https://arxiv.org/abs/2412.14680)
|
| 25 |
+
[](LICENSE)
|
| 26 |
+
</div>
|
| 27 |
+
</div>
|
| 28 |
+
|
| 29 |
+
## 1. Introduction
|
| 30 |
+
|
| 31 |
+
### 1.1 Brief Introduction of DOSOD
|
| 32 |
+
|
| 33 |
+
Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
|
| 34 |
+
open-vocabulary detection has been extensively applied in various scenarios.
|
| 35 |
+
Real-time open-vocabulary detection has attracted significant attention.
|
| 36 |
+
In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a
|
| 37 |
+
practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
|
| 38 |
+
Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
|
| 39 |
+
A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
|
| 40 |
+
within which the detector learns the region representations of class-agnostic proposals.
|
| 41 |
+
Cross-modality features are directly aligned in the joint space,
|
| 42 |
+
avoiding the complex feature interactions and thereby improving computational efficiency.
|
| 43 |
+
DOSOD functions like a traditional closed-set detector during the testing phase,
|
| 44 |
+
effectively bridging the gap between closed-set and open-set detection.
|
| 45 |
+
|
| 46 |
+
## 2. Model Overview
|
| 47 |
+
|
| 48 |
+
Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
|
| 49 |
+
All pre-trained models are released.
|
| 50 |
+
|
| 51 |
+
### 2.1 Zero-shot Evaluation on LVIS minival
|
| 52 |
+
|
| 53 |
+
<div><font size=2>
|
| 54 |
+
|
| 55 |
+
| model | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | weights |
|
| 56 |
+
|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:|
|
| 57 |
+
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(repo)</div> | O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|
| 58 |
+
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(repo)</div> | O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
|
| 59 |
+
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(repo)</div> | O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
|
| 60 |
+
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div> | O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|
| 61 |
+
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div> | O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
|
| 62 |
+
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div> | O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
|
| 63 |
+
| [YOLO-Worldv2-S]() | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
|
| 64 |
+
| [YOLO-Worldv2-M]() | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
|
| 65 |
+
| [YOLO-Worldv2-L]() | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | [HF Checkpoints π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
|
| 66 |
+
| [DOSOD-S]() | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | [HF Checkpoints π€](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth) |
|
| 67 |
+
| [DOSOD-M]() | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | [HF Checkpoints π€](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth) |
|
| 68 |
+
| [DOSOD-L]() | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | [HF Checkpoints π€](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth) |
|
| 69 |
+
|
| 70 |
+
> NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.
|
| 71 |
+
|
| 72 |
+
</font>
|
| 73 |
+
</div>
|
| 74 |
+
|
| 75 |
+
### 2.2 Zero-shot Inference on COCO dataset
|
| 76 |
+
|
| 77 |
+
<div><font size=2>
|
| 78 |
+
|
| 79 |
+
| model | Pre-train Data | Size | AP | AP<sub>50</sub> | AP<sub>75</sub> |
|
| 80 |
+
|:--------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:----:|:---------------:|:---------------:|
|
| 81 |
+
| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div> | O365+GoldG | 640 | 37.6 | 52.3 | 40.7 |
|
| 82 |
+
| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div> | O365+GoldG | 640 | 42.8 | 58.3 | 46.4 |
|
| 83 |
+
| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div> | O365+GoldG | 640 | 44.4 | 59.8 | 48.3 |
|
| 84 |
+
| [YOLO-Worldv2-S]() | O365+GoldG | 640 | 37.5 | 52.0 | 40.7 |
|
| 85 |
+
| [YOLO-Worldv2-M]() | O365+GoldG | 640 | 42.8 | 58.2 | 46.7 |
|
| 86 |
+
| [YOLO-Worldv2-L]() | O365+GoldG | 640 | 45.4 | 61.0 | 49.4 |
|
| 87 |
+
| [DOSOD-S]() | O365+GoldG | 640 | 36.1 | 51.0 | 39.1 |
|
| 88 |
+
| [DOSOD-M]() | O365+GoldG | 640 | 41.7 | 57.1 | 45.2 |
|
| 89 |
+
| [DOSOD-L]() | O365+GoldG | 640 | 44.6 | 60.5 | 48.4 |
|
| 90 |
+
|
| 91 |
+
</font>
|
| 92 |
+
</div>
|
| 93 |
+
|
| 94 |
+
### 2.3 Latency On RTX 4090
|
| 95 |
+
|
| 96 |
+
We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode.
|
| 97 |
+
All models are re-parameterized with 80 categories from COCO.
|
| 98 |
+
Log info can be found by clicking the FPS.
|
| 99 |
+
|
| 100 |
+
| model | Params | FPS |
|
| 101 |
+
|:--------------:|:------:|:---------------------------------------:|
|
| 102 |
+
| YOLO-Worldv1-S | 13.32M | 1007 |
|
| 103 |
+
| YOLO-Worldv1-M | 28.93M | 702 |
|
| 104 |
+
| YOLO-Worldv1-L | 47.38M | 494 |
|
| 105 |
+
| YOLO-Worldv2-S | 12.66M | 1221 |
|
| 106 |
+
| YOLO-Worldv2-M | 28.20M | 771 |
|
| 107 |
+
| YOLO-Worldv2-L | 46.62M | 553 |
|
| 108 |
+
| DOSOD-S | 11.48M | 1582 |
|
| 109 |
+
| DOSOD-M | 26.31M | 922 |
|
| 110 |
+
| DOSOD-L | 44.19M | 632 |
|
| 111 |
+
|
| 112 |
+
> NOTE: FPS = 1000 / GPU Compute Time[mean]
|
| 113 |
+
|
| 114 |
+
### 2.4 Latency On RDK X5
|
| 115 |
+
|
| 116 |
+
We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5).
|
| 117 |
+
The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.
|
| 118 |
+
|
| 119 |
+
| model | FPS (1 thread) | FPS (8 threads) |
|
| 120 |
+
|:-------------------------------:|:--------------:|:---------------:|
|
| 121 |
+
| YOLO-Worldv2-S<br/>(INT16/INT8) | 5.962/11.044 | 6.386/12.590 |
|
| 122 |
+
| YOLO-Worldv2-M<br/>(INT16/INT8) | 4.136/7.290 | 4.340/7.930 |
|
| 123 |
+
| YOLO-Worldv2-L<br/>(INT16/INT8) | 2.958/5.377 | 3.060/5.720 |
|
| 124 |
+
| DOSOD-S<br/>(INT16/INT8) | 12.527/31.020 | 14.657/47.328 |
|
| 125 |
+
| DOSOD-M<br/>(INT16/INT8) | 8.531/20.238 | 9.471/26.36 |
|
| 126 |
+
| DOSOD-L<br/>(INT16/INT8) | 5.663/12.799 | 6.069/14.939 |
|