OwlColBERT🦉 Code Retrieval Model

This is a code-specialized ColBERT-style late-interaction retrieval model built with PyLate.

The model is based on the NightOwl / OwlColBERT model family, whose backbone was trained from random initialization with a custom code-aware tokenizer and code-heavy pretraining corpus.
The final retrieval model was built through a multi-stage pipeline: code-specialized pretraining, dense retrieval training, hard negative mining with Qwen3-Embedding-0.6B, PyLate late-interaction training, and final Cornstack refinement.

It is designed for semantic code search, especially natural-language-to-code retrieval.
Given a natural language query such as "function that validates an email address", the model retrieves relevant code snippets, functions, or chunks by scoring query-code pairs with the MaxSim late-interaction operator.

Highlights

  • Architecture: ModernBERT-based ColBERT model
  • Backbone: NightOwl / OwlColBERT code-specialized model family
  • Tokenizer: custom code-aware tokenizer with whitespace-related special tokens
  • Pretraining: trained from random initialization on code-heavy and technical corpora
  • Continued pretraining: code-specialized line-level masking
  • Retrieval type: multi-vector / late interaction
  • Interaction: MaxSim
  • Embedding dimension: 256
  • Document length: up to 2048 tokens
  • Query length: up to 512 tokens
  • Training loss: CachedContrastive
  • Hard negative mining: Qwen3-Embedding-0.6B
  • Final refinement: Cornstack, 3000 steps
  • Evaluation: CodeSearchNetRetrieval
  • Average nDCG@10 on CodeSearchNetRetrieval: 0.9031

Model Details

Model Description

  • Model Type: PyLate ColBERT / late-interaction retrieval model
  • Base Model: Shuu12121/OwlColBERT-v0
  • Backbone family: NightOwl / ModernBERT
  • Document Max Length: 2048 tokens
  • Query Max Length: 512 tokens
  • Output Dimensionality: 256 dimensions
  • Similarity Function: MaxSim
  • Training Objective: CachedContrastive
  • Primary Domain: source code and code-related text
  • Main Task: natural-language-to-code retrieval

Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 2047, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Dense({'in_features': 768, 'out_features': 256, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)

Pretraining and Model Construction

This model is part of the NightOwl / OwlColBERT model family. The backbone was not simply initialized from an off-the-shelf pretrained language model. Instead, it was built with a code-specialized tokenizer and trained from random initialization.

Custom Code-aware Tokenizer

The backbone model uses a custom tokenizer designed for source code.

Whitespace-related tokens were added as special tokens to reduce undesirable token merging around indentation and formatting. This is useful for source code because whitespace, indentation, and formatting often contain structural information, especially in languages such as Python.

Because the tokenizer was changed, the model was trained from random initialization rather than directly reusing the weights of an existing pretrained checkpoint.

Pretraining Corpus

The base model was pretrained on a mixture of code and technical text, including:

  • GitHub source code collected from public repositories
  • code-related documents
  • StarCoder2-extra-style code and technical data
  • technical documentation
  • arXiv-style scientific text
  • mathematical problem data
  • additional programming-related datasets

The goal of this stage was to build a code-aware backbone that can represent both natural language queries and source code.

Code-specialized Continued Pretraining

After the initial pretraining stage, the model was further adapted to source code using a code-oriented continued pretraining objective.

This stage used line-level masking, where code lines were masked while whitespace tokens were ignored during masking. The goal was to encourage the model to learn stronger representations of code-bearing tokens, identifiers, expressions, and line-level code semantics.

Dense Retrieval Training

Before training the final late-interaction model, a dense retriever was trained using Sentence Transformers.

This dense model was trained on CodeSearchNet / Cornstack-style retrieval data. It maps each query or code document into a single 768-dimensional vector and scores pairs with cosine similarity.

This dense retriever served as an intermediate retrieval model before the final ColBERT-style late-interaction training stage.

Hard Negative Mining

Hard negatives were generated from the Owl code retrieval corpus:

A strong external embedding model, Qwen3-Embedding-0.6B, was used to mine difficult negatives. These hard negatives were then used to train the late-interaction model.

PyLate / ColBERT Late-interaction Training

The final model was trained as a ColBERT-style late-interaction retriever using PyLate.

Unlike single-vector dense retrievers, this model represents each query and document as a sequence of token-level vectors. The relevance score is computed with MaxSim, allowing fine-grained matching between natural language query tokens and code tokens.

This helps the model capture local matches such as:

  • API names
  • function names
  • identifiers
  • code expressions
  • syntax-adjacent semantic matches
  • natural language comments and descriptions

Final Cornstack Refinement

Finally, the model was further trained for 3000 steps on Cornstack-style data.

This final stage was intended to:

  • increase hard negative diversity
  • improve retrieval robustness
  • improve JavaScript retrieval performance
  • further adapt the late-interaction model to high-quality code retrieval pairs

Intended Use

This model is intended for:

  • natural-language-to-code retrieval
  • semantic code search
  • function-level code search
  • chunk-level code retrieval
  • reranking candidates from a first-stage retriever
  • local repository search
  • code assistant retrieval backends

Example queries:

function that validates an email address
parse json configuration file
create HTTP server middleware
sort list of objects by key
read csv file and return dataframe
calculate cosine similarity between vectors

Out-of-Scope Use

This model is not intended for:

  • code generation
  • instruction following
  • general chat
  • formal program verification
  • security auditing
  • license compliance checking
  • proving semantic equivalence between programs

The model retrieves code that is semantically similar to a query, but it does not guarantee that the retrieved code is correct, secure, licensed appropriately, or production-ready.

Usage

Install PyLate:

pip install -U pylate

Indexing Documents

from pylate import indexes, models

model = models.ColBERT(
    model_name_or_path="Shuu12121/OwlColBERT-v1",
)

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,
)

documents_ids = ["1", "2", "3"]

documents = [
    "def validate_email(email): ...",
    "def load_json_config(path): ...",
    "def create_server(host, port): ...",
]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,
    show_progress_bar=True,
)

index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Loading an Existing Index

from pylate import indexes

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

Retrieval

from pylate import indexes, models, retrieve

model = models.ColBERT(
    model_name_or_path="Shuu12121/OwlColBERT-v1",
)

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

retriever = retrieve.ColBERT(index=index)

queries_embeddings = model.encode(
    ["function that validates an email address"],
    batch_size=32,
    is_query=True,
    show_progress_bar=True,
)

results = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,
)

print(results)

Reranking

from pylate import rank, models

model = models.ColBERT(
    model_name_or_path="Shuu12121/OwlColBERT-v1",
)

queries = [
    "function that validates an email address",
]

documents = [
    [
        "def validate_email(email): ...",
        "def parse_json(path): ...",
        "def train_model(dataset): ...",
    ]
]

documents_ids = [
    ["email_validator", "json_parser", "trainer"]
]

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

print(reranked_documents)

Evaluation

The model was evaluated on CodeSearchNetRetrieval using an MTEB-style retrieval evaluation setup.

This evaluation covers six programming languages:

  • Python
  • JavaScript
  • Go
  • Ruby
  • Java
  • PHP

The main metric is nDCG@10.

Summary

Metric Average
Accuracy@1 0.8445
nDCG@10 0.9031
MRR@10 0.8869
MAP@100 0.8873
Recall@10 0.9518

Per-language Results

Language Accuracy@1 Recall@10 nDCG@10 MRR@10 MAP@100
Python 0.895 0.987 0.9476 0.9343 0.9349
JavaScript 0.747 0.881 0.8187 0.7982 0.8010
Go 0.923 0.992 0.9630 0.9531 0.9533
Ruby 0.792 0.926 0.8640 0.8436 0.8451
Java 0.888 0.968 0.9313 0.9192 0.9197
PHP 0.822 0.957 0.8937 0.8729 0.8736
Average 0.8445 0.9518 0.9031 0.8869 0.8873

Full nDCG Results

Language nDCG@1 nDCG@3 nDCG@5 nDCG@10 nDCG@20 nDCG@100 nDCG@1000
Python 0.8950 0.9424 0.9457 0.9476 0.9494 0.9500 0.9501
JavaScript 0.7470 0.8036 0.8131 0.8187 0.8268 0.8305 0.8351
Go 0.9230 0.9586 0.9624 0.9630 0.9633 0.9640 0.9642
Ruby 0.7920 0.8508 0.8591 0.8640 0.8673 0.8725 0.8743
Java 0.8880 0.9239 0.9280 0.9313 0.9323 0.9345 0.9352
PHP 0.8220 0.8785 0.8874 0.8937 0.8949 0.8972 0.8996
Average 0.8445 0.8930 0.8993 0.9031 0.9057 0.9081 0.9097

Observations

The model performs particularly well on Go, Python, and Java, achieving nDCG@10 scores of 0.9630, 0.9476, and 0.9313, respectively.

The weakest language is JavaScript, with nDCG@10 of 0.8187. This suggests that JavaScript may benefit from additional language-specific hard negatives, more diverse JavaScript training data, or further refinement on JavaScript-heavy retrieval datasets.

Overall, the model achieves 0.9031 average nDCG@10 on multilingual CodeSearchNetRetrieval, indicating strong performance for natural-language-to-code retrieval.

Training Details

Training Dataset

The model was trained on code retrieval pairs with multiple hard negatives per query.

The training data included:

  • query
  • document
  • negative_0 ... negative_99
  • negative_scores

Hard negatives were mined from the Owl code retrieval corpus using Qwen3-Embedding-0.6B.

Loss

pylate.losses.cached_contrastive.CachedContrastive

Important Hyperparameters

Hyperparameter Value
Learning rate 3e-6
Max steps 3000
Batch size per device 120
Weight decay 0.01
Scheduler cosine
Precision bf16
Gradient checkpointing true
Optimizer adamw_torch_fused

Training Logs

The following table reports validation scores during the final 3000-step Cornstack refinement stage. These values were used for monitoring training progress and are not the final CodeSearchNetRetrieval test results reported above.

Step Python Go Java JavaScript PHP Ruby
100 0.9446 0.9211 0.8530 0.8031 0.8018 0.9055
500 0.9440 0.9256 0.8509 0.8122 0.8001 0.9026
1000 0.9454 0.9278 0.8495 0.8166 0.8002 0.9042
1500 0.9448 0.9275 0.8498 0.8172 0.8008 0.9080
2000 0.9447 0.9271 0.8504 0.8195 0.7994 0.9065
2500 0.9447 0.9272 0.8505 0.8175 0.7992 0.9072
3000 0.9454 0.9271 0.8495 0.8175 0.7998 0.9068

Recommendations

For best results:

  • chunk long files before indexing
  • keep function-level or class-level boundaries when possible
  • preserve code formatting where possible
  • include file paths or surrounding context if they are useful for retrieval
  • use a first-stage retriever when indexing very large repositories
  • use this model as a reranker when latency is important

Limitations

  • The model is optimized for code retrieval and may not perform well as a general-purpose text embedding model.
  • Performance may vary across programming languages and repository styles.
  • Very short or ambiguous queries may produce unstable results.
  • Long files should be chunked before indexing.
  • The model retrieves semantically related code but does not verify correctness or security.
  • The evaluation is focused on CodeSearchNetRetrieval; performance on other code intelligence tasks may differ.
  • The model may inherit biases, artifacts, or licensing constraints from GitHub-collected code, technical documents, and retrieval datasets used during training.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.3.0
  • PyLate: 1.4.0
  • Transformers: 4.56.2
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.12.0
  • Datasets: 3.6.0
  • Tokenizers: 0.22.1

Citation

If you use this model, please cite the relevant libraries and methods.

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@inproceedings{DBLP:conf/cikm/ChaffinS25,
  author       = {Antoine Chaffin and Rapha{"{e}}l Sourty},
  title        = {PyLate: Flexible Training and Retrieval for Late Interaction Models},
  booktitle    = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
  year         = {2025},
  url          = {https://github.com/lightonai/pylate},
  doi          = {10.1145/3746252.3761608},
}

CachedContrastive

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
52
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shuu12121/OwlColBERT-v1

Finetuned
(1)
this model

Datasets used to train Shuu12121/OwlColBERT-v1

Papers for Shuu12121/OwlColBERT-v1