OwlColBERT🦉 Code Retrieval Model

This is a code-specialized ColBERT-style late-interaction retrieval model built with PyLate.

The model is based on the NightOwl / OwlColBERT model family, whose backbone was trained from random initialization with a custom code-aware tokenizer and code-heavy pretraining corpus.
The final retrieval model was built through a multi-stage pipeline: code-specialized pretraining, dense retrieval training, hard negative mining with Qwen3-Embedding-0.6B, PyLate late-interaction training, and final Cornstack refinement.

It is designed for semantic code search, especially natural-language-to-code retrieval.
Given a natural language query such as "function that validates an email address", the model retrieves relevant code snippets, functions, or chunks by scoring query-code pairs with the MaxSim late-interaction operator.

Highlights

Architecture: ModernBERT-based ColBERT model
Backbone: NightOwl / OwlColBERT code-specialized model family
Tokenizer: custom code-aware tokenizer with whitespace-related special tokens
Pretraining: trained from random initialization on code-heavy and technical corpora
Continued pretraining: code-specialized line-level masking
Retrieval type: multi-vector / late interaction
Interaction: MaxSim
Embedding dimension: 256
Document length: up to 2048 tokens
Query length: up to 512 tokens
Training loss: CachedContrastive
Hard negative mining: Qwen3-Embedding-0.6B
Final refinement: Cornstack, 3000 steps
Evaluation: CodeSearchNetRetrieval
Average nDCG@10 on CodeSearchNetRetrieval: 0.9031

Model Details

Model Description

Model Type: PyLate ColBERT / late-interaction retrieval model
Base Model: Shuu12121/OwlColBERT-v0
Backbone family: NightOwl / ModernBERT
Document Max Length: 2048 tokens
Query Max Length: 512 tokens
Output Dimensionality: 256 dimensions
Similarity Function: MaxSim
Training Objective: CachedContrastive
Primary Domain: source code and code-related text
Main Task: natural-language-to-code retrieval

Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 2047, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Dense({'in_features': 768, 'out_features': 256, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)

Pretraining and Model Construction

This model is part of the NightOwl / OwlColBERT model family. The backbone was not simply initialized from an off-the-shelf pretrained language model. Instead, it was built with a code-specialized tokenizer and trained from random initialization.

Custom Code-aware Tokenizer

The backbone model uses a custom tokenizer designed for source code.

Whitespace-related tokens were added as special tokens to reduce undesirable token merging around indentation and formatting. This is useful for source code because whitespace, indentation, and formatting often contain structural information, especially in languages such as Python.

Because the tokenizer was changed, the model was trained from random initialization rather than directly reusing the weights of an existing pretrained checkpoint.

Pretraining Corpus

The base model was pretrained on a mixture of code and technical text, including:

GitHub source code collected from public repositories
code-related documents
StarCoder2-extra-style code and technical data
technical documentation
arXiv-style scientific text
mathematical problem data
additional programming-related datasets

The goal of this stage was to build a code-aware backbone that can represent both natural language queries and source code.

Code-specialized Continued Pretraining

After the initial pretraining stage, the model was further adapted to source code using a code-oriented continued pretraining objective.

This stage used line-level masking, where code lines were masked while whitespace tokens were ignored during masking. The goal was to encourage the model to learn stronger representations of code-bearing tokens, identifiers, expressions, and line-level code semantics.

Dense Retrieval Training

Before training the final late-interaction model, a dense retriever was trained using Sentence Transformers.

This dense model was trained on CodeSearchNet / Cornstack-style retrieval data. It maps each query or code document into a single 768-dimensional vector and scores pairs with cosine similarity.

This dense retriever served as an intermediate retrieval model before the final ColBERT-style late-interaction training stage.

Hard Negative Mining

Hard negatives were generated from the Owl code retrieval corpus:

Shuu12121/codesearch-datasets

A strong external embedding model, Qwen3-Embedding-0.6B, was used to mine difficult negatives. These hard negatives were then used to train the late-interaction model.

PyLate / ColBERT Late-interaction Training

The final model was trained as a ColBERT-style late-interaction retriever using PyLate.

Unlike single-vector dense retrievers, this model represents each query and document as a sequence of token-level vectors. The relevance score is computed with MaxSim, allowing fine-grained matching between natural language query tokens and code tokens.

This helps the model capture local matches such as:

API names
function names
identifiers
code expressions
syntax-adjacent semantic matches
natural language comments and descriptions

Final Cornstack Refinement

Finally, the model was further trained for 3000 steps on Cornstack-style data.

This final stage was intended to:

increase hard negative diversity
improve retrieval robustness
improve JavaScript retrieval performance
further adapt the late-interaction model to high-quality code retrieval pairs

Intended Use

This model is intended for:

natural-language-to-code retrieval
semantic code search
function-level code search
chunk-level code retrieval
reranking candidates from a first-stage retriever
local repository search
code assistant retrieval backends

Example queries:

function that validates an email address
parse json configuration file
create HTTP server middleware
sort list of objects by key
read csv file and return dataframe
calculate cosine similarity between vectors

Out-of-Scope Use

This model is not intended for:

code generation
instruction following
general chat
formal program verification
security auditing
license compliance checking
proving semantic equivalence between programs

The model retrieves code that is semantically similar to a query, but it does not guarantee that the retrieved code is correct, secure, licensed appropriately, or production-ready.

Usage

Install PyLate:

pip install -U pylate

Indexing Documents

from pylate import indexes, models

model = models.ColBERT(
    model_name_or_path="Shuu12121/OwlColBERT-v1",
)

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,
)

documents_ids = ["1", "2", "3"]

documents = [
    "def validate_email(email): ...",
    "def load_json_config(path): ...",
    "def create_server(host, port): ...",
]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,
    show_progress_bar=True,
)

index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Loading an Existing Index

from pylate import indexes

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

Retrieval

from pylate import indexes, models, retrieve

model = models.ColBERT(
    model_name_or_path="Shuu12121/OwlColBERT-v1",
)

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

retriever = retrieve.ColBERT(index=index)

queries_embeddings = model.encode(
    ["function that validates an email address"],
    batch_size=32,
    is_query=True,
    show_progress_bar=True,
)

results = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,
)

print(results)

Reranking

from pylate import rank, models

model = models.ColBERT(
    model_name_or_path="Shuu12121/OwlColBERT-v1",
)

queries = [
    "function that validates an email address",
]

documents = [
    [
        "def validate_email(email): ...",
        "def parse_json(path): ...",
        "def train_model(dataset): ...",
    ]
]

documents_ids = [
    ["email_validator", "json_parser", "trainer"]
]

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

print(reranked_documents)

Evaluation

The model was evaluated on CodeSearchNetRetrieval using an MTEB-style retrieval evaluation setup.

This evaluation covers six programming languages:

Python
JavaScript
Go
Ruby
Java
PHP

The main metric is nDCG@10.

Summary

Metric	Average
Accuracy@1	0.8445
nDCG@10	0.9031
MRR@10	0.8869
MAP@100	0.8873
Recall@10	0.9518

Per-language Results

Language	Accuracy@1	Recall@10	nDCG@10	MRR@10	MAP@100
Python	0.895	0.987	0.9476	0.9343	0.9349
JavaScript	0.747	0.881	0.8187	0.7982	0.8010
Go	0.923	0.992	0.9630	0.9531	0.9533
Ruby	0.792	0.926	0.8640	0.8436	0.8451
Java	0.888	0.968	0.9313	0.9192	0.9197
PHP	0.822	0.957	0.8937	0.8729	0.8736
Average	0.8445	0.9518	0.9031	0.8869	0.8873

Full nDCG Results

Language	nDCG@1	nDCG@3	nDCG@5	nDCG@10	nDCG@20	nDCG@100	nDCG@1000
Python	0.8950	0.9424	0.9457	0.9476	0.9494	0.9500	0.9501
JavaScript	0.7470	0.8036	0.8131	0.8187	0.8268	0.8305	0.8351
Go	0.9230	0.9586	0.9624	0.9630	0.9633	0.9640	0.9642
Ruby	0.7920	0.8508	0.8591	0.8640	0.8673	0.8725	0.8743
Java	0.8880	0.9239	0.9280	0.9313	0.9323	0.9345	0.9352
PHP	0.8220	0.8785	0.8874	0.8937	0.8949	0.8972	0.8996
Average	0.8445	0.8930	0.8993	0.9031	0.9057	0.9081	0.9097

Observations

The model performs particularly well on Go, Python, and Java, achieving nDCG@10 scores of 0.9630, 0.9476, and 0.9313, respectively.

The weakest language is JavaScript, with nDCG@10 of 0.8187. This suggests that JavaScript may benefit from additional language-specific hard negatives, more diverse JavaScript training data, or further refinement on JavaScript-heavy retrieval datasets.

Overall, the model achieves 0.9031 average nDCG@10 on multilingual CodeSearchNetRetrieval, indicating strong performance for natural-language-to-code retrieval.

Training Details

Training Dataset

The model was trained on code retrieval pairs with multiple hard negatives per query.

The training data included:

query
document
negative_0 ... negative_99
negative_scores

Hard negatives were mined from the Owl code retrieval corpus using Qwen3-Embedding-0.6B.

Loss

pylate.losses.cached_contrastive.CachedContrastive

Important Hyperparameters

Hyperparameter	Value
Learning rate	3e-6
Max steps	3000
Batch size per device	120
Weight decay	0.01
Scheduler	cosine
Precision	bf16
Gradient checkpointing	true
Optimizer	adamw_torch_fused

Training Logs

The following table reports validation scores during the final 3000-step Cornstack refinement stage. These values were used for monitoring training progress and are not the final CodeSearchNetRetrieval test results reported above.

Step	Python	Go	Java	JavaScript	PHP	Ruby
100	0.9446	0.9211	0.8530	0.8031	0.8018	0.9055
500	0.9440	0.9256	0.8509	0.8122	0.8001	0.9026
1000	0.9454	0.9278	0.8495	0.8166	0.8002	0.9042
1500	0.9448	0.9275	0.8498	0.8172	0.8008	0.9080
2000	0.9447	0.9271	0.8504	0.8195	0.7994	0.9065
2500	0.9447	0.9272	0.8505	0.8175	0.7992	0.9072
3000	0.9454	0.9271	0.8495	0.8175	0.7998	0.9068

Recommendations

For best results:

chunk long files before indexing
keep function-level or class-level boundaries when possible
preserve code formatting where possible
include file paths or surrounding context if they are useful for retrieval
use a first-stage retriever when indexing very large repositories
use this model as a reranker when latency is important

Limitations

The model is optimized for code retrieval and may not perform well as a general-purpose text embedding model.
Performance may vary across programming languages and repository styles.
Very short or ambiguous queries may produce unstable results.
Long files should be chunked before indexing.
The model retrieves semantically related code but does not verify correctness or security.
The evaluation is focused on CodeSearchNetRetrieval; performance on other code intelligence tasks may differ.
The model may inherit biases, artifacts, or licensing constraints from GitHub-collected code, technical documents, and retrieval datasets used during training.

Framework Versions

Python: 3.10.12
Sentence Transformers: 5.3.0
PyLate: 1.4.0
Transformers: 4.56.2
PyTorch: 2.8.0+cu128
Accelerate: 1.12.0
Datasets: 3.6.0
Tokenizers: 0.22.1

Citation

If you use this model, please cite the relevant libraries and methods.

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@inproceedings{DBLP:conf/cikm/ChaffinS25,
  author       = {Antoine Chaffin and Rapha{"{e}}l Sourty},
  title        = {PyLate: Flexible Training and Retrieval for Late Interaction Models},
  booktitle    = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
  year         = {2025},
  url          = {https://github.com/lightonai/pylate},
  doi          = {10.1145/3746252.3761608},
}

CachedContrastive

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Downloads last month: 52

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Shuu12121/OwlColBERT-v1

Base model

Shuu12121/OwlColBERT-v0

Finetuned

(1)

this model

Datasets used to train Shuu12121/OwlColBERT-v1

Papers for Shuu12121/OwlColBERT-v1

Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup

Paper • 2101.06983 • Published Jan 18, 2021 • 2

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 13