Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Abstract
The Qwen3-VL-Embedding and Qwen3-VL-Reranker models form an end-to-end multimodal search pipeline, leveraging multi-stage training and cross-attention mechanisms to achieve high-precision retrieval across diverse modalities.
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
Community
๐ Introducing Qwen3-VL-Embedding and Qwen3-VL-Reranker โ advancing the state of the art in multimodal retrieval and cross-modal understanding!
โจ Highlights:
โ
Built upon the robust Qwen3-VL foundation model
โ
Processes text, images, screenshots, videos, and mixed modality inputs
โ
Supports 30+ languages
โ
Achieves state-of-the-art performance on multimodal retrieval benchmarks
โ
Open source and available on Hugging Face, GitHub, and ModelScope
โ
API deployment on Alibaba Cloud coming soon!
๐ฏ Two-stage retrieval architecture:
๐ Embedding Model โ generates semantically rich vector representations in a unified embedding space
๐ฏ Reranker Model โ computes fine-grained relevance scores for enhanced retrieval accuracy
๐ Key application scenarios:
Image-text retrieval, video search, multimodal RAG, visual question answering, multimodal content clustering, multilingual visual search, and more!
๐ Developer-friendly capabilities:
โข Configurable embedding dimensions
โข Task-specific instruction customization
โข Embedding quantization support for efficient and cost-effective downstream deployment
Hugging Face๏ผ
https://huggingface.co/collections/Qwen/qwen3-vl-embedding
https://huggingface.co/collections/Qwen/qwen3-vl-reranker
Github: https://github.com/QwenLM/Qwen3-VL-Embedding
Blog: https://qwen.ai/blog?id=qwen3-vl-embedding
Tech Report: https://www.arxiv.org/abs/2601.04720
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Qwen3-VL Technical Report (2025)
- ReMatch: Boosting Representation through Matching for Multimodal Retrieval (2025)
- Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey (2025)
- ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning (2025)
- Jina-VLM: Small Multilingual Vision Language Model (2025)
- ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering (2025)
- O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper