--- license: cc datasets: - encord-team/E-MM1-100M - encord-team/E-MM1-1M language: - en --- # Model Card for `ebind-points-vision` ![ebind](https://cdn-uploads.huggingface.co/production/uploads/62a0da842e30aaf94ebaaa12/EohI585GsKe5cFlvWyWAX.png)

# EBind: Multi-Modal Embeddings ## Model Details ### Model Description EBind is a multi-modal embedding model that supports image, video, audio, text, and 3D point cloud inputs. All modalities are projected into a shared embedding space, enabling cross-modal similarity computations. The model builds on top of three other models; [Perception Encoder](https://huggingface.co/facebook/PE-Core-L14-336), [ImageBind](https://huggingface.co/nielsr/imagebind-huge), and [Uni3D](https://github.com/baaivision/Uni3D). As indicated by the figure in the top, data is first embedded individually by the three said models. Audio and 3D point cloud embeddings are successively projected with an MLP into the embedding space of the Perception Encoder. The model produces unit-norm embeddings directly usable for similarity comparisons via dot-products ([cosine similarity]). This version loads the 3D points and vision encoders. If you would like the version that loads all encoders, please refer to [ebind-full](https://huggingface.co/encord-team/ebind-full), or if you would like the version loads the audio and vision encoders please refer to the [audio-vision](https://huggingface.co/encord-team/ebind-audio-vision) model. - **Developed by:** The Encord ML Team ([ml@encord.com](mailto:ml@encord.com)) - **Model type:** Multimodal embedding model. - **License:** The model is published under the [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) license. ### Model Sources - **Repository:** [Github](https://github.com/encord-team/ebind) - **Project Page:** [e-mm1.github.io](https://e-mm1.github.io) - **Paper [optional]:** Coming soon. - **Demo [optional]:** [Explore the embedding space](https://data.encord.com) ## Uses ### Direct Use The model is intended to be used with direct file-inputs of the said modalities; image, video, 3D, and text. It will produce a 1024 dimension embedding per input, suited for similarity computations. **Downstream Use** The model could be used to build multimodal LLMs, generative models, and systems that perceive their surroundings via both visual and point cloud embeddings. ## Bias, Risks, and Limitations The model was built on data specified in the paper. As such, it will be biased towards data that "lives on the internet." For specific use-cases, a subsequent fine-tuning stage may be necessary. ## How to Get Started with the Model **Option 1** If you want to work within the repository, use [`uv`](https://docs.astral.sh/uv/) to install the necessary dependencies. ```bash git clone https://github.com/encord-team/ebind cd ebind uv sync ``` **Option 2** You can also install it as an external dependency for another project: ```bash # Option 2.a python -m pip install git@https://github.com/encord-team/ebind # Option 2.b; or install a local, editable version git clone https://github.com/encord-team/ebind cd /path/to/your/project python -m pip install -e /path/to/ebind ``` > [!WARNING] > If you are running a project with pytorch=2.8.0, you should install torchcodec=0.7.0 (as opposed to the =0.8.0) > which is automatically installed with uv. torchcodec=0.8.* matches pytorch=2.9.0. > [!NOTE] > The 3D point cloud backbone has a few custom CUDA kernels that you might want to [compile](#compile-pointnet2-cuda-ops-optional). > To do that, you will have to do use Option 1 or Option 2.b above to get a local copy of the repository and compile the kernels. ### Loading the Model ```python import torch from ebind import EBindModel, EBindProcessor model = EBindModel.from_pretrained("encord-team/ebind-full") processor = EBindProcessor.from_pretrained("encord-team/ebind-full") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device).eval() processor = processor.to(device) ``` ### Processing Multi-Modal Inputs ```python inputs = { "image": ["examples/dog.png", "examples/cat.png"], "video": ["examples/dog.mp4", "examples/cat.mp4"], "text": ["A dog is howling in the street", "A cat is sleeping on the couch"], "points": ["examples/dog_point_cloud.npy", "examples/cat_point_cloud.npy"], } with torch.inference_mode(): batch = processor(inputs, return_tensors="pt") # set text_file_paths=True if passing text file paths instead of strings outputs = model.forward(**batch) ``` ### Computing Cross-Modal Similarities ```python keys = list(outputs.keys()) for i, modality in enumerate(keys): for j, modality2 in enumerate(keys[i + 1:]): result = outputs[modality] @ outputs[modality2].T print(f"{modality} x {modality2}:") print(result.cpu().detach().numpy()) print('='*26) ``` Expected Output: ``` image x video similarity: [[0.48 0.42] [0.41 0.6 ]] ========================== image x text similarity: [[0.16 0.07] [0.08 0.14]] ========================== image x points similarity: [[0.2 0.19] [0.18 0.19]] ========================== video x text similarity: [[0.26 0.05] [0.11 0.14]] ========================== video x points similarity: [[0.24 0.15] [0.17 0.26]] ========================== text x points similarity: [[0.19 0.14] [0.05 0.18]] ========================== ``` **Note:** The image/video similarity is significantly higher because they share the same vision encoder. ### Compile PointNet2 CUDA ops (optional) If you have CUDA available, consider building the [PointNet2](https://github.com/erikwijmans/Pointnet2_PyTorch/tree/master/pointnet2_ops_lib/pointnet2_ops/_ext-src) custom ops used for embedding point clouds to get faster inference: ```bash cd src/ebind/models/uni3d/pointnet2_ops && \ uv run python -c "import torch,sys; sys.exit(0 if torch.cuda.is_available() else 1)" && \ MAX_JOBS=$(nproc) uv run python setup.py build_ext --inplace ``` > We have modified the code slightly in `src/ebind/models/uni3d/pointnet2_ops/pointnet2_utils.py` to > have a fallback torch implementation in order for the model to be executable on no-GPU > hardware. ## Evaluation We have evaluated the model on multiple benchmarks. We highlight that EBind is performing close to as well as models 4 and 17 times larger. Please see more information on performance benchmarks on the [ebind-full model card](https://huggingface.co/encord-team/ebind-full). **BibTeX:** ``` @misc{broadbent2025ebindpracticalapproachspace, title={{EBind}: a practical approach to space binding}, author={Jim Broadbent and Felix Cohen and Frederik Hvilshøj and Eric Landau and Eren Sasoglu}, year={2025}, eprint={2511.14229}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.14229}, } ``` ## Try it now Explore the multimodal E-MM1 dataset behind this model [here](https://data.encord.com/e-mm1/explorer)! ## Model Card Contact Please reach out to [ml@encord.com](mailto:ml@encord.com) with any questions or feedback