Safetensors
English
ebind
ebind-points-vision / README.md
fhvilshoj's picture
Update README.md
9941fa5 verified
---
license: cc
datasets:
- encord-team/E-MM1-100M
- encord-team/E-MM1-1M
language:
- en
---
# Model Card for `ebind-points-vision`
![ebind](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F62a0da842e30aaf94ebaaa12%2FEohI585GsKe5cFlvWyWAX.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END -->
<div style="display: flex; justify-content: space-between;">
<div style="flex: 1; padding: 10px;">
<!-- <a href="todohttps://arxiv.org/abs/YYMM.NNNNN" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/arXiv-YYMM.NNNNN-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
</a> -->
<a href="https://colab.research.google.com/github/encord-team/ebind/blob/main/misc/demo.ipynb" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="vertical-align:middle;">
</a>
<a href="https://huggingface.co/encord-team/ebind-full" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face Models" style="vertical-align:middle;">
</a>
<a href="https://huggingface.co/datasets/encord-team/E-MM1-100M" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue" alt="Hugging Face Datasets" style="vertical-align:middle;">
</a>
<a href="https://e-mm1.github.io" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/Project%20Page-blue?logo=github" alt="Blog" style="vertical-align:middle;">
</a>
<div style="flex:1"></div>
<a href="https://encord.com/blog/how-we-built-multimodal-dataset-emm1/" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img src="https://img.shields.io/badge/%F0%9F%93%96-Blog-blue" alt="Blog" style="vertical-align:middle;">
</a>
<a href="https://twitter.com/encord_team" target="_blank" rel="noreferrer" style="text-decoration:none; ">
<img alt="Twitter Follow" src="https://img.shields.io/twitter/follow/encord_team?label=%40encord_team&amp;style=social" style="vertical-align: middle">
</a>
<img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-Welcome-blue" style="vertical-align: middle;">
</div>
</div>
# EBind: Multi-Modal Embeddings
## Model Details
### Model Description
EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computations.
The model builds on top of three other models; [Perception Encoder](https://huggingface.co/facebook/PE-Core-L14-336), [ImageBind](https://huggingface.co/nielsr/imagebind-huge), and [Uni3D](https://github.com/baaivision/Uni3D).
As indicated by the figure in the top, data is first embedded individually by the three said models.
Audio and 3D point cloud embeddings are successively projected with an MLP into the embedding space of the Perception Encoder.
The model produces unit-norm embeddings directly usable for similarity comparisons via dot-products ([cosine similarity]).
This version loads the 3D points and vision encoders.
If you would like the version that loads all encoders, please refer to [ebind-full](https://huggingface.co/encord-team/ebind-full), or if you would like the version loads the audio and vision encoders please refer to the [audio-vision](https://huggingface.co/encord-team/ebind-audio-vision) model.
- **Developed by:** The Encord ML Team ([[email protected]](mailto:[email protected]))
- **Model type:** Multimodal embedding model.
- **License:** The model is published under the [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) license.
### Model Sources
- **Repository:** [Github](https://github.com/encord-team/ebind)
- **Project Page:** [e-mm1.github.io](https://e-mm1.github.io)
- **Paper [optional]:** Coming soon.
- **Demo [optional]:** [Explore the embedding space](https://data.encord.com)
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
The model is intended to be used with direct file-inputs of the said modalities; image, video, 3D, and text. It will produce a 1024 dimension embedding per input, suited for similarity computations.
**Downstream Use**
The model could be used to build multimodal LLMs, generative models, and systems that perceive their surroundings via both visual and point cloud embeddings.
## Bias, Risks, and Limitations
The model was built on data specified in the paper.
As such, it will be biased towards data that "lives on the internet."
For specific use-cases, a subsequent fine-tuning stage may be necessary.
## How to Get Started with the Model
**Option 1**
If you want to work within the repository, use [`uv`](https://docs.astral.sh/uv/) to install the necessary dependencies.
```bash
git clone https://github.com/encord-team/ebind
cd ebind
uv sync
```
**Option 2**
You can also install it as an external dependency for another project:
```bash
# Option 2.a
python -m pip install git@https://github.com/encord-team/ebind
# Option 2.b; or install a local, editable version
git clone https://github.com/encord-team/ebind
cd /path/to/your/project
python -m pip install -e /path/to/ebind
```
> [!WARNING]
> If you are running a project with pytorch=2.8.0, you should install torchcodec=0.7.0 (as opposed to the =0.8.0)
> which is automatically installed with uv. torchcodec=0.8.* matches pytorch=2.9.0.
> [!NOTE]
> The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile](#compile-pointnet2-cuda-ops-optional).
> To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels.
### Loading the Model
```python
import torch
from ebind import EBindModel, EBindProcessor
model = EBindModel.from_pretrained("encord-team/ebind-full")
processor = EBindProcessor.from_pretrained("encord-team/ebind-full")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()
processor = processor.to(device)
```
### Processing Multi-Modal Inputs
```python
inputs = {
"image": ["examples/dog.png", "examples/cat.png"],
"video": ["examples/dog.mp4", "examples/cat.mp4"],
"text": ["A dog is howling in the street", "A cat is sleeping on the couch"],
"points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"],
}
with torch.inference_mode():
batch = processor(inputs, return_tensors="pt") # set text_file_paths=True if passing text file paths instead of strings
outputs = model.forward(**batch)
```
### Computing Cross-Modal Similarities
```python
keys = list(outputs.keys())
for i, modality in enumerate(keys):
for j, modality2 in enumerate(keys[i + 1:]):
result = outputs[modality] @ outputs[modality2].T
print(f"{modality} x {modality2}:")
print(result.cpu().detach().numpy())
print('='*26)
```
Expected Output:
```
image x video similarity:
[[0.48 0.42]
[0.41 0.6 ]]
==========================
image x text similarity:
[[0.16 0.07]
[0.08 0.14]]
==========================
image x points similarity:
[[0.2 0.19]
[0.18 0.19]]
==========================
video x text similarity:
[[0.26 0.05]
[0.11 0.14]]
==========================
video x points similarity:
[[0.24 0.15]
[0.17 0.26]]
==========================
text x points similarity:
[[0.19 0.14]
[0.05 0.18]]
==========================
```
**Note:** The image/video similarity is significantly higher because they share the same vision encoder.
### Compile PointNet2 CUDA ops (optional)
If you have CUDA available, consider building the [PointNet2](https://github.com/erikwijmans/Pointnet2_PyTorch/tree/master/pointnet2_ops_lib/pointnet2_ops/_ext-src) custom ops used for embedding point clouds to get faster inference:
```bash
cd src/ebind/models/uni3d/pointnet2_ops && \
uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \
MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace
```
> We have modified the code slightly in `src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py` to
> have a fallback torch implementation in order for the model to be executable on no-GPU
> hardware.
## Evaluation
We have evaluated the model on multiple benchmarks.
We highlight that EBind is performing close to as well as models 4 and 17 times larger.
Please see more information on performance benchmarks on the [ebind-full model card](https://huggingface.co/encord-team/ebind-full).
**BibTeX:**
```
@misc{broadbent2025ebindpracticalapproachspace,
title={{EBind}: a practical approach to space binding},
author={Jim Broadbent and Felix Cohen and Frederik Hvilshøj and Eric Landau and Eren Sasoglu},
year={2025},
eprint={2511.14229},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.14229},
}
```
## Try it now
Explore the multimodal E-MM1 dataset behind this model [here](https://data.encord.com/e-mm1/explorer)!
## Model Card Contact
Please reach out to [[email protected]](mailto:[email protected]) with any questions or feedback