D-Robotics
/

DOSOD

Model card Files Files and versions

xet

Community

YonghaoHe commited on Dec 23, 2024

Commit

0ee8f60

verified ·

1 Parent(s): 04e5e41

Update README.md

Browse files

Files changed (1) hide show

README.md +126 -3

README.md CHANGED Viewed

@@ -1,3 +1,126 @@
----
-license: gpl-3.0
----

+<div align="center">
+<br>
+<h1>DOSOD<br>
+A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
+</h1>
+<br>
+<a href="https://github.com/YonghaoHe">Yonghao He</a><sup><span>1,*,🌟 </span></sup>,
+<a href="https://people.ucas.edu.cn/~suhu">Hu Su</a><sup><span>2,*,📧</span></sup>,
+<a href="https://github.com/HarveyYesan">Haiyong Yu</a><sup><span>1,*</span></sup>,
+<a href="https://cong-yang.github.io/">Cong Yang</a><sup><span>3</span></sup>,
+<a href="">Wei Sui</a><sup><span>1</span></sup>,
+<a href="">Cong Wang</a><sup><span>1</span></sup>,
+<a href="www.amnrlab.org">Song Liu</a><sup><span>4,📧</span></sup>
+<br>
+\* Equal contribution, 🌟 Project lead, 📧 Corresponding author
+<sup>1</sup> D-Robotics, <br>
+<sup>2</sup> State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,<br>
+<sup>3</sup> BeeLab, School of Future Science and Engineering, Soochow University, <br>
+<sup>4</sup> the School of Information Science and Technology, ShanghaiTech
+University
+[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2412.14680)
+[![license](https://img.shields.io/badge/License-GPLv3.0-blue)](LICENSE)
+</div>
+</div>
+## 1. Introduction
+### 1.1 Brief Introduction of DOSOD
+Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World,
+open-vocabulary detection has been extensively applied in various scenarios.
+Real-time open-vocabulary detection has attracted significant attention.
+In our paper, Decoupled Open-Set Object Detection (**DOSOD**) is proposed as a
+practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems.
+Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector.
+A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space,
+within which the detector learns the region representations of class-agnostic proposals.
+Cross-modality features are directly aligned in the joint space,
+avoiding the complex feature interactions and thereby improving computational efficiency.
+DOSOD functions like a traditional closed-set detector during the testing phase,
+effectively bridging the gap between closed-set and open-set detection.
+## 2. Model Overview
+Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the `LVIS minival` and `COCO val2017`.
+All pre-trained models are released.
+### 2.1 Zero-shot Evaluation on LVIS minival
+<div><font size=2>
+|                                                                                     model                                                                                      | Pre-train Data | Size | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> |                                                              weights                                                               |
+|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:-----------------:|:--------------:|:--------------:|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------:|
+| <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(repo)</div> | O365+GoldG     | 640  |       24.3        |      16.6      |      22.1      |      27.7      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
+| <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(repo)</div> | O365+GoldG     | 640  |       28.6        |      19.7      |      26.6      |      31.9      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
+| <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(repo)</div> | O365+GoldG     | 640  |       32.5        |      22.3      |      30.6      |      36.1      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
+|                                                      <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       26.2        |      19.1      |      23.6      |      29.8      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
+|                                                      <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       31.0        |      23.8      |      29.2      |      33.9      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
+|                                                      <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div>                                                      | O365+GoldG     | 640  |       35.0        |      27.1      |      32.8      |      38.3      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
+|                              [YOLO-Worldv2-S]()                              | O365+GoldG     | 640  |       22.7        |      16.3      |      20.8      |      25.5      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_s_obj365v1_goldg_pretrain-55b943ea.pth) |
+|                              [YOLO-Worldv2-M]()                              | O365+GoldG     | 640  |       30.0        |      25.0      |      27.2      |      33.4      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_m_obj365v1_goldg_pretrain-c6237d5b.pth) |
+|                              [YOLO-Worldv2-L]()                              | O365+GoldG     | 640  |       33.0        |      22.6      |      32.0      |      35.8      | [HF Checkpoints 🤗](https://huggingface.co/wondervictor/YOLO-World/blob/main/yolo_world_v2_l_obj365v1_goldg_pretrain-a82b1fe3.pth) |
+|                                     [DOSOD-S]()                                     | O365+GoldG     | 640  |       26.7        |      19.9      |      25.1      |      29.3      |                      [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_s.pth)                      |
+|                                     [DOSOD-M]()                                     | O365+GoldG     | 640  |       31.3        |      25.7      |      29.6      |      33.7      |                              [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_m.pth)                                                                                                       |
+|                                     [DOSOD-L]()                                     | O365+GoldG     | 640  |       34.4        |      29.1      |      32.6      |      36.6      |                                       [HF Checkpoints 🤗](https://huggingface.co/D-Robotics/DOSOD/blob/main/dosod_mlp3x_l.pth)                                                                                              |
+> NOTE: The results of YOLO-Worldv1 from repo and [paper](https://arxiv.org/abs/2401.17270) are different.
+</font>
+</div>
+### 2.2 Zero-shot Inference on COCO dataset
+<div><font size=2>
+|                                                        model                                                         | Pre-train Data | Size |  AP  | AP<sub>50</sub> | AP<sub>75</sub> |
+|:--------------------------------------------------------------------------------------------------------------------:|:---------------|:-----|:----:|:---------------:|:---------------:|
+|                         <div style="text-align: center;">[YOLO-Worldv1-S]()<br>(paper)</div>                         | O365+GoldG     | 640  | 37.6 |      52.3       |      40.7       |
+|                         <div style="text-align: center;">[YOLO-Worldv1-M]()<br>(paper)</div>                         | O365+GoldG     | 640  | 42.8 |      58.3       |      46.4       |
+|                         <div style="text-align: center;">[YOLO-Worldv1-L]()<br>(paper)</div>                         | O365+GoldG     | 640  | 44.4 |      59.8       |      48.3       |
+| [YOLO-Worldv2-S]() | O365+GoldG     | 640  | 37.5 |      52.0       |      40.7       |
+| [YOLO-Worldv2-M]() | O365+GoldG     | 640  | 42.8 |      58.2       |      46.7       |
+| [YOLO-Worldv2-L]() | O365+GoldG     | 640  | 45.4 |      61.0       |      49.4       |
+|        [DOSOD-S]()        | O365+GoldG     | 640  | 36.1 |      51.0       |      39.1       |
+|        [DOSOD-M]()        | O365+GoldG     | 640  | 41.7 |      57.1       |      45.2       |
+|        [DOSOD-L]()        | O365+GoldG     | 640  | 44.6 |      60.5       |      48.4       |
+</font>
+</div>
+### 2.3 Latency On RTX 4090
+We utilize the tool of `trtexec` in [TensorRT 8.6.1.6](https://developer.nvidia.com/tensorrt) to assess the latency in FP16 mode.
+All models are re-parameterized with 80 categories from COCO.
+Log info can be found by clicking the FPS.
+|     model      | Params |                   FPS                   |
+|:--------------:|:------:|:---------------------------------------:|
+| YOLO-Worldv1-S | 13.32M | 1007 |
+| YOLO-Worldv1-M | 28.93M | 702  |
+| YOLO-Worldv1-L | 47.38M | 494  |
+| YOLO-Worldv2-S | 12.66M | 1221 |
+| YOLO-Worldv2-M | 28.20M | 771 |
+| YOLO-Worldv2-L | 46.62M | 553  |
+|    DOSOD-S     | 11.48M |    1582    |
+|    DOSOD-M     | 26.31M |     922     |
+|    DOSOD-L     | 44.19M |     632     |
+> NOTE: FPS = 1000 / GPU Compute Time[mean]
+### 2.4 Latency On RDK X5
+We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of [D-Robotics RDK X5](https://d-robotics.cc/rdkx5).
+The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.
+|              model              | FPS (1 thread) | FPS (8 threads) |
+|:-------------------------------:|:--------------:|:---------------:|
+| YOLO-Worldv2-S<br/>(INT16/INT8) |  5.962/11.044  |  6.386/12.590   |
+| YOLO-Worldv2-M<br/>(INT16/INT8) |  4.136/7.290   |   4.340/7.930   |
+| YOLO-Worldv2-L<br/>(INT16/INT8) |  2.958/5.377   |   3.060/5.720   |
+|    DOSOD-S<br/>(INT16/INT8)     | 12.527/31.020  |  14.657/47.328  |
+|    DOSOD-M<br/>(INT16/INT8)     |  8.531/20.238  |   9.471/26.36   |
+|    DOSOD-L<br/>(INT16/INT8)     |  5.663/12.799  |  6.069/14.939   |