MTEB Leaderboard
Embedding Leaderboard
When we built it, we did what every good engineering team does: we followed the best practices. Context-aware chunking. Hybrid search. Careful chunk size optimization. Models from the MTEB leaderboard.
We benchmarked 156 queries across three languages. Nearly every "best practice" was wrong for our use case.
The results surprised us: naive chunking outperformed context-aware (70.5% vs 63.8%). Chunk size barely mattered. Hybrid retrieval lost to dense-only (69.2% vs 63.5%). The winning embedding model? Not even on the MTEB leaderboard.
This article shares our complete journey: benchmarking methodology, surprising performance findings, and hard-won lessons from deploying RAG in production. From choosing between Mistral OCR and open-source alternatives, to discovering why AWS OpenSearch cost us $70/day for mediocre results, to optimizing our infrastructure—we're sharing what actually worked (and what didn't).
The only way to know what works for you is to measure. Here's how we did it.
Key Findings:
Remember, these results are specific to our use case (scientific documents in English, French and Japanese).
Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external knowledge retrieval. Instead of relying solely on the LLM's training data, RAG fetches relevant documents and provides them as context for generation.
Core components:
Our benchmark uses real documents from our production database:
We created test questions covering different difficulty levels:
Our test dataset consists of 156 unique test queries per embedding model:
Here's an example from our test set:
MetaQuestion(
source_document="Jaiman et al. - 2023 - Mechanics of Flow-Induced Vibration Physical Mode.pdf",
related_questions=[
# Affirmatives
Question(
text="Cylinder vibrations induced by fluid flow",
language=Language.ENGLISH,
form=Form.affirmative,
),
Question(
text="Vibrations des cylindres induites par l’écoulement des fluides",
language=Language.FRENCH,
form=Form.affirmative,
),
Question(
text="流体の流れによって誘発される円筒の振動",
language=Language.JAPANESE,
form=Form.affirmative,
),
# Interrogatives
Question(
text="What can you tell me about cylinder vibrations induced by fluid flow?",
language=Language.ENGLISH,
form=Form.interrogative,
),
Question(
text="Que peux-tu me dire sur les vibrations des cylindres induites par l'écoulement des fluides ?",
language=Language.FRENCH,
form=Form.interrogative,
),
Question(
text="流体の流れによって誘発される円筒の振動について教えてください。",
language=Language.JAPANESE,
form=Form.interrogative,
),
],
)
Each query has a known ground truth - we know which document should be retrieved.
Important Note on Our Retrieval Goal:
Goal: Our primary objective was to retrieve the right document, not necessarily the exact paragraph or section. Once we have the correct document, our LLM can extract the specific information needed. This is a critical distinction that influenced our benchmark design and conclusions.
We measured retrieval performance using:
Since we only care about document-level retrieval, not chunk-level precision, this has important implications for chunk size optimization (see findings below).
We tested two approaches:
Naive chunking: Simple character-based splitting with overlap using LangChain's RecursiveCharacterTextSplitter. Chunks are created by splitting at natural boundaries (paragraphs, sentences) without understanding document structure.
Context-aware chunking: Parses markdown structure (headings, sections) into a hierarchical tree. Each chunk includes its parent section headings as context. For example, a chunk from "Section 2.3 → Subsection 2.3.1" includes both heading levels, preserving document structure.
We benchmarked different combinations:
This gave us multiple configurations to systematically compare across all dimensions.
We compared three embedding models across all configurations, the chunk size strategy was the naive one with a chunk size of 2000 and an overlap of 400.
Model Selection Criteria:
We specifically wanted API-accessible embedding models to avoid managing model inference infrastructure. This led us to test:
amazon.titan-embed-text-v2:0): Accessed via AWS BedrockAlibaba-NLP/gte-Qwen2-7B-instruct): Accessed via Hugging Face Inference API (provider: Nebius AI)mistral-embed): Accessed via Mistral AI APIAll three provide simple REST API access, which was essential for our production requirements.
Results:
The MTEB leaderboard surprise:
If you're choosing an embedding model, you'd naturally check the MTEB (Massive Text Embedding Benchmark) leaderboard on Hugging Face. It's the standard benchmark for comparing embedding models across various tasks.
Here's the surprising part: AWS Titan V2 isn't even on the MTEB leaderboard. Yet it outperformed both Qwen 8B and Mistral (which ARE on the leaderboard) for our use case.
Why Titan wins for us:
The trap of traditional benchmarks:
Most embedding benchmarks test on English-only affirmative queries. Here's what happens when we limit our analysis to just those conditions:
Critical Insight: Under "traditional" benchmark conditions (English affirmative questions only), Mistral performs almost on par with Titan (76.9% vs 80.8% hit rate). This looks promising! However, when we tested across all languages and query forms, Mistral's performance dropped significantly (39.1% overall vs 69.2% for Titan).
The key difference: consistency. Titan maintains strong performance across English, French, Japanese, and both query forms. Mistral excels in narrow conditions but lacks robustness across diverse real-world queries.
This is precisely why we chose Titan over Mistral despite Mistral's competitive performance under ideal conditions. Production RAG systems need models that work consistently across varied query patterns, not just those that excel in controlled benchmarks.
Lesson learned: When evaluating embedding models, test them under diverse conditions that match your production use case. A model that dominates English-only benchmarks may struggle with multilingual content or different query formulations. Look for consistency, not just peak performance.
We tested various chunk sizes: 2K, 4K, 6K characters with Titan V2 and 2K, 10K, 40K characters with Qwen 8B.
Results: Chunk size had minimal impact on performance. The variation between different chunk sizes was negligible—all configurations performed within a few percentage points of each other.
Why chunk size doesn't matter much for us:
Remember, our goal is document-level retrieval, not finding the exact paragraph. As long as any chunk from the correct document ranks highly, we succeed. Larger chunks still contain the relevant content, just with more surrounding context.
Practical recommendation: Don't over-optimize this
Unlike typical RAG advice that emphasizes careful chunk size tuning, our data shows chunk size is simply not a critical parameter. Here's what we found:
What we use:
Bottom line: We switched from 2K to larger chunks for cost efficiency, but honestly, any reasonable size will work. Don't spend days optimizing this—focus on embedding model selection and retrieval mode instead.
We tested naive chunking (simple character-based splitting) vs context-aware chunking (respecting document structure like sections and paragraphs).
Results: Naive chunking outperformed context-aware chunking for Titan embeddings (71.8% vs 67.9% hit rate at best, 70.5% vs 63.8% average across chunk sizes with dense-only).
Interpretation:
Recommendation: Start simple with naive chunking. Only use context-aware if you have specific structure-dependent requirements.
Understanding Retrieval Modes:
Vector databases like Qdrant support different retrieval approaches:
Dense search: Uses semantic embeddings (like Titan or Qwen) to find documents based on meaning. Converts queries and documents into high-dimensional vectors and finds similar vectors. Great for conceptual matches but can miss exact keyword requirements.
Sparse search: Uses keyword-based matching (similar to BM25 or TF-IDF). Qdrant implements this via FastEmbed's sparse embeddings (we used prithvida/Splade_PP_en_v1). Excellent for exact term matches but misses semantic similarity.
Hybrid search: Combines both approaches—semantic understanding from dense embeddings + keyword precision from sparse embeddings. The results are merged using score fusion. Conventional wisdom says this should always be better than either approach alone.
Our Findings:
Conventional wisdom says hybrid search should outperform dense-only. This was actually why we chose Qdrant. Our data shows the opposite for our use case.
Findings: Dense-only achieved 69.2% hit rate vs 63.5% for hybrid (using Titan embeddings with 2K character naive chunking).
Why this might happen:
Important context: We chose Qdrant specifically for its hybrid search capabilities. While dense-only performed better in our benchmarks, hybrid remains valuable for:
Recommendation: Benchmark both modes on your specific corpus. Don't assume hybrid is always better.
Our documents span English, French, and Japanese. We needed robust cross-lingual retrieval.
Results:
English significantly outperformed French and Japanese. This suggests our RAG system works best with English content, though French and Japanese still achieved reasonable retrieval rates. The multilingual support of Titan embeddings was validated, even if performance varied by language.
Model Comparison Across Languages:
The graph below compares all three embedding models (Titan, Qwen, and Mistral) across the three languages we tested:
Titan consistently outperformed other models across all languages, with particularly strong performance in English. The gap between models was most pronounced in French and Japanese, where Titan's multilingual capabilities showed clear advantages.
Model Comparison Across Query Forms:
We also analyzed performance across different query forms (interrogative vs affirmative):
Interestingly, TITAN and Mistral perform better with affirmative statements, as theory would predict (since most information in text is presented in the affirmative form). Qwen, however, performs better with interrogative statements, which really does not make any sense.
Decision: We use Mistral OCR for all document processing.
Evaluation: We manually tested 3 complex PDFs (equations, tables, diagrams, scanned pages) with:
Findings: Only Mistral OCR correctly parsed complex mathematical notation and tables in scanned documents. The output quality difference was dramatic, it even worked with chemical equations.
Why Markdown Conversion is Critical:
Converting all documents (PDFs, Word files, etc.) to markdown isn't just about performance—it's absolutely essential for building a debuggable RAG system:
Bottom line: Without markdown as an intermediate format, you can't effectively debug or iterate on your RAG system. It's not optional—it's foundational.
Trade-off: Mistral OCR is expensive (1$ per 1000 pages). For scientific documents where accuracy is critical, it's worth it. If you have simpler PDFs, try open-source alternatives first.
Decision: We use Qdrant (managed service).
Context: We evaluated Milvus, Qdrant, AWS OpenSearch, Pinecone, and PostgreSQL with pgvector. Being AWS-native, we initially tried OpenSearch.
Critical finding: Don't use AWS OpenSearch for vector search. The cheapest option is $70 /day (~$2,100/month) for a single-node cluster. This is severely overpriced for what it offers.
Why Qdrant:
Why not Qdrant:
Why not the other alternatives:
Building a production RAG system requires balancing performance, cost, and complexity. Here's our recommended starting point:
Recommended Configuration:
Key Takeaway: Don't blindly follow "best practices" from blog posts. Benchmark on your specific document types and query patterns. Our findings contradicted common wisdom (dense-only beating hybrid, naive beating context-aware), but they were reproducible and significant for our use case. What worked for us may not work for you. The only way to know is to measure.
Our benchmark code and methodology are detailed in this article for reproducibility. However, the scientific documents we used are proprietary and closed-source (nuclear engineering research and regulatory materials). The source code as well can't be shared, it is part of our mono-repo.
While we can't share the raw data, we're happy to answer questions about our methodology, testing approach, or specific findings. Feel free to reach out if you're implementing something similar!
Questions or feedback? We'd love to hear about your RAG experiences, especially if you found different results!
Here are key code snippets from our implementation that you might find useful, it is very basic:
Naive Chunking - Simple character-based splitting with overlap using LangChain:
# src/pyjimmy/rag/chunk.py:32-34
def split_markdown(s3_markdown_path: Path, max_chunk_size: int, strategy: ChunkingStrategy) -> list[str]:
if strategy == ChunkingStrategy.naive:
return RecursiveCharacterTextSplitter(
chunk_size=max_chunk_size,
chunk_overlap=400,
add_start_index=True
).split_text(markdown_text)
Context-Aware Chunking - Preserves document structure by parsing markdown hierarchically:
# src/pyjimmy/rag/chunk.py:82-99
class Section:
def __init__(self, body: str = "", title: str | None = None, level: int = 0):
self.body: str = body
self.title: str | None = title
self.level: int = level
self.children: list[Section] = []
@classmethod
def from_markdown(cls, markdown_text: str) -> Section:
"""Parse markdown into hierarchical sections based on heading levels."""
lines = markdown_text.split("\n")
root = cls()
stack = [root]
for line in lines:
if line.startswith("#"):
level = len(line) - len(line.lstrip("#"))
title = line.lstrip("#").strip()
new_section = cls(title=title, level=level)
while stack and stack[-1].level >= level:
stack.pop()
stack[-1]._add_child(new_section)
stack.append(new_section)
else:
if stack:
stack[-1].body += line + "\n"
return root
def to_chunks(self, max_chunk_size: int, context: str = "") -> list[str]:
context += self.current_context
chunks = []
if self.body.replace("\n", "").strip():
if len(context) >= max_chunk_size:
logger.warning(
f"Context too large ({len(context)} chars) for max_chunk_size {max_chunk_size}. Skipping context."
)
context = ""
chunks += [
context + splitted_body
for splitted_body in _split_text_into_chunks(self.body, max_chunk_size - len(context))
]
for child in self.children:
chunks += child.to_chunks(max_chunk_size, context)
return chunks
Converting PDFs to Markdown - Handling complex scientific documents:
# src/pyjimmy/rag/pdf.py:22-45
def convert_pdf_to_markdown(pdf_path: Path, output_path: Path) -> tuple[Path, Path]:
logger.info(f"Start converting PDF {pdf_path} to markdown...")
# Split large PDFs to stay under 50MB API limit
pdfs_under_50mb = _split_pdf_to_under_50mb_improved(pdf_path, output_path)
final_markdown = ""
final_pages = []
for pdf in pdfs_under_50mb:
ocr_response = _convert_pdf_under_50MB_to_markdown(pdf)
markdown = _get_combined_markdown(ocr_response=ocr_response)
pages = ocr_response.model_dump()["pages"]
final_markdown += markdown
final_pages.extend(pages)
markdown_path = output_path / f"{pdf_path.stem}.md"
markdown_path.write_text(final_markdown)
return markdown_path, yaml_path
Mistral OCR API Call:
# src/pyjimmy/rag/pdf.py:175-192
def _convert_pdf_under_50MB_to_markdown(pdf_path: Path) -> OCRResponse:
uploaded_pdf = MISTRAL_CLIENT.files.upload(
file=File(
file_name=pdf_path.name,
content=pdf_path.read_bytes(),
content_type="application/pdf"
),
purpose="ocr",
)
signed_url = MISTRAL_CLIENT.files.get_signed_url(file_id=uploaded_pdf.id)
return MISTRAL_CLIENT.ocr.process(
model="mistral-ocr-latest",
document={"type": "document_url", "document_url": signed_url.url},
include_image_base64=True,
)
Configuring Different Embedding Models:
# src/pyjimmy/rag/embedding.py:33-58
def get_dense_embedding_model(model: DenseEmbeddingModel) -> Embeddings:
if model == DenseEmbeddingModel.titan:
# AWS Titan V2: 8,192 max tokens, 50K max characters
return BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
if model == DenseEmbeddingModel.qwen:
# Qwen 8B: 32K token context window
return HuggingFaceEndpointEmbeddings(
client=InferenceClient(provider="nebius", api_key=token),
model="Qwen/Qwen3-Embedding-8B"
)
if model == DenseEmbeddingModel.mistral:
return MistralAIEmbeddings(
mistral_api_key=os.environ["PYJIMMY_MISTRAL_API_KEY"],
model="mistral-embed"
)
# Sparse embeddings for hybrid search
SPARSE_EMBEDDING_MODEL = FastEmbedSparse(model_name="prithvida/Splade_PP_en_v1")
Creating Collections with Hybrid Search:
# src/pyjimmy/rag/qdrant.py:27-40
def create_vector_store(collection_name: str, model: DenseEmbeddingModel):
logger.info(f"Creating Qdrant collection {collection_name}...")
QdrantVectorStore.from_texts(
texts=[],
url=QDRANT_URL,
api_key=QDRANT_API_KEY,
collection_name=collection_name,
embedding=get_dense_embedding_model(model),
sparse_embedding=SPARSE_EMBEDDING_MODEL, # For hybrid search
retrieval_mode=RetrievalMode.HYBRID,
)
Adding Documents with Retry Logic:
# src/pyjimmy/rag/qdrant.py:64-72, 78-104
@retry(wait=wait_fixed(60))
def _add_texts_to_vector_store(
vector_store: QdrantVectorStore,
texts: list[str],
s3_folder: Path
) -> list[str]:
"""Retry on failure due to embedding API rate limits."""
return vector_store.add_texts(
texts,
metadatas=[{"s3_folder": s3_folder} for _ in texts]
)
def add_markdown_to_qdrant(
s3_markdown_path: Path,
max_chunk_size: int,
chunking_strategy: ChunkingStrategy,
model: DenseEmbeddingModel
):
chunks = split_markdown(s3_markdown_path, max_chunk_size, strategy=chunking_strategy)
vector_store = get_vector_store(collection_name, model=model)
vector_ids = []
max_chunks_one_request = 10
# Batch processing to avoid timeouts
for start_index in range(0, len(chunks), max_chunks_one_request):
vector_ids.extend(
_add_texts_to_vector_store(
vector_store,
chunks[start_index : start_index + max_chunks_one_request],
s3_folder=s3_folder
)
)
Embedding Leaderboard
Thanks for sharing. I am also building applications that deal with technical text, though mostly English. One thing that helped my information retrieval pipeline was to add a reranker (e.g https://huggingface.co/BAAI/bge-reranker-v2-m3 ) after the dense/hybrid search. If needed, these are lightweight enough that they can be fine-tuned to very specific domains of text (see https://huggingface.co/blog/train-reranker), though that hasn’t been needed yet for what I am building.