OwlColBERT🦉 Code Retrieval Model
This is a code-specialized ColBERT-style late-interaction retrieval model built with PyLate.
The model is based on the NightOwl / OwlColBERT model family, whose backbone was trained from random initialization with a custom code-aware tokenizer and code-heavy pretraining corpus.
The final retrieval model was built through a multi-stage pipeline: code-specialized pretraining, dense retrieval training, hard negative mining with Qwen3-Embedding-0.6B, PyLate late-interaction training, and final Cornstack refinement.
It is designed for semantic code search, especially natural-language-to-code retrieval.
Given a natural language query such as "function that validates an email address", the model retrieves relevant code snippets, functions, or chunks by scoring query-code pairs with the MaxSim late-interaction operator.
Highlights
- Architecture: ModernBERT-based ColBERT model
- Backbone: NightOwl / OwlColBERT code-specialized model family
- Tokenizer: custom code-aware tokenizer with whitespace-related special tokens
- Pretraining: trained from random initialization on code-heavy and technical corpora
- Continued pretraining: code-specialized line-level masking
- Retrieval type: multi-vector / late interaction
- Interaction: MaxSim
- Embedding dimension: 256
- Document length: up to 2048 tokens
- Query length: up to 512 tokens
- Training loss: CachedContrastive
- Hard negative mining: Qwen3-Embedding-0.6B
- Final refinement: Cornstack, 3000 steps
- Evaluation: CodeSearchNetRetrieval
- Average nDCG@10 on CodeSearchNetRetrieval: 0.9031
Model Details
Model Description
- Model Type: PyLate ColBERT / late-interaction retrieval model
- Base Model:
Shuu12121/OwlColBERT-v0 - Backbone family: NightOwl / ModernBERT
- Document Max Length: 2048 tokens
- Query Max Length: 512 tokens
- Output Dimensionality: 256 dimensions
- Similarity Function: MaxSim
- Training Objective: CachedContrastive
- Primary Domain: source code and code-related text
- Main Task: natural-language-to-code retrieval
Architecture
ColBERT(
(0): Transformer({'max_seq_length': 2047, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Dense({'in_features': 768, 'out_features': 256, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False})
)
Pretraining and Model Construction
This model is part of the NightOwl / OwlColBERT model family. The backbone was not simply initialized from an off-the-shelf pretrained language model. Instead, it was built with a code-specialized tokenizer and trained from random initialization.
Custom Code-aware Tokenizer
The backbone model uses a custom tokenizer designed for source code.
Whitespace-related tokens were added as special tokens to reduce undesirable token merging around indentation and formatting. This is useful for source code because whitespace, indentation, and formatting often contain structural information, especially in languages such as Python.
Because the tokenizer was changed, the model was trained from random initialization rather than directly reusing the weights of an existing pretrained checkpoint.
Pretraining Corpus
The base model was pretrained on a mixture of code and technical text, including:
- GitHub source code collected from public repositories
- code-related documents
- StarCoder2-extra-style code and technical data
- technical documentation
- arXiv-style scientific text
- mathematical problem data
- additional programming-related datasets
The goal of this stage was to build a code-aware backbone that can represent both natural language queries and source code.
Code-specialized Continued Pretraining
After the initial pretraining stage, the model was further adapted to source code using a code-oriented continued pretraining objective.
This stage used line-level masking, where code lines were masked while whitespace tokens were ignored during masking. The goal was to encourage the model to learn stronger representations of code-bearing tokens, identifiers, expressions, and line-level code semantics.
Dense Retrieval Training
Before training the final late-interaction model, a dense retriever was trained using Sentence Transformers.
This dense model was trained on CodeSearchNet / Cornstack-style retrieval data. It maps each query or code document into a single 768-dimensional vector and scores pairs with cosine similarity.
This dense retriever served as an intermediate retrieval model before the final ColBERT-style late-interaction training stage.
Hard Negative Mining
Hard negatives were generated from the Owl code retrieval corpus:
A strong external embedding model, Qwen3-Embedding-0.6B, was used to mine difficult negatives. These hard negatives were then used to train the late-interaction model.
PyLate / ColBERT Late-interaction Training
The final model was trained as a ColBERT-style late-interaction retriever using PyLate.
Unlike single-vector dense retrievers, this model represents each query and document as a sequence of token-level vectors. The relevance score is computed with MaxSim, allowing fine-grained matching between natural language query tokens and code tokens.
This helps the model capture local matches such as:
- API names
- function names
- identifiers
- code expressions
- syntax-adjacent semantic matches
- natural language comments and descriptions
Final Cornstack Refinement
Finally, the model was further trained for 3000 steps on Cornstack-style data.
This final stage was intended to:
- increase hard negative diversity
- improve retrieval robustness
- improve JavaScript retrieval performance
- further adapt the late-interaction model to high-quality code retrieval pairs
Intended Use
This model is intended for:
- natural-language-to-code retrieval
- semantic code search
- function-level code search
- chunk-level code retrieval
- reranking candidates from a first-stage retriever
- local repository search
- code assistant retrieval backends
Example queries:
function that validates an email address
parse json configuration file
create HTTP server middleware
sort list of objects by key
read csv file and return dataframe
calculate cosine similarity between vectors
Out-of-Scope Use
This model is not intended for:
- code generation
- instruction following
- general chat
- formal program verification
- security auditing
- license compliance checking
- proving semantic equivalence between programs
The model retrieves code that is semantically similar to a query, but it does not guarantee that the retrieved code is correct, secure, licensed appropriately, or production-ready.
Usage
Install PyLate:
pip install -U pylate
Indexing Documents
from pylate import indexes, models
model = models.ColBERT(
model_name_or_path="Shuu12121/OwlColBERT-v1",
)
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True,
)
documents_ids = ["1", "2", "3"]
documents = [
"def validate_email(email): ...",
"def load_json_config(path): ...",
"def create_server(host, port): ...",
]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False,
show_progress_bar=True,
)
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
Loading an Existing Index
from pylate import indexes
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
)
Retrieval
from pylate import indexes, models, retrieve
model = models.ColBERT(
model_name_or_path="Shuu12121/OwlColBERT-v1",
)
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
)
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(
["function that validates an email address"],
batch_size=32,
is_query=True,
show_progress_bar=True,
)
results = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10,
)
print(results)
Reranking
from pylate import rank, models
model = models.ColBERT(
model_name_or_path="Shuu12121/OwlColBERT-v1",
)
queries = [
"function that validates an email address",
]
documents = [
[
"def validate_email(email): ...",
"def parse_json(path): ...",
"def train_model(dataset): ...",
]
]
documents_ids = [
["email_validator", "json_parser", "trainer"]
]
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
print(reranked_documents)
Evaluation
The model was evaluated on CodeSearchNetRetrieval using an MTEB-style retrieval evaluation setup.
This evaluation covers six programming languages:
- Python
- JavaScript
- Go
- Ruby
- Java
- PHP
The main metric is nDCG@10.
Summary
| Metric | Average |
|---|---|
| Accuracy@1 | 0.8445 |
| nDCG@10 | 0.9031 |
| MRR@10 | 0.8869 |
| MAP@100 | 0.8873 |
| Recall@10 | 0.9518 |
Per-language Results
| Language | Accuracy@1 | Recall@10 | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|---|---|
| Python | 0.895 | 0.987 | 0.9476 | 0.9343 | 0.9349 |
| JavaScript | 0.747 | 0.881 | 0.8187 | 0.7982 | 0.8010 |
| Go | 0.923 | 0.992 | 0.9630 | 0.9531 | 0.9533 |
| Ruby | 0.792 | 0.926 | 0.8640 | 0.8436 | 0.8451 |
| Java | 0.888 | 0.968 | 0.9313 | 0.9192 | 0.9197 |
| PHP | 0.822 | 0.957 | 0.8937 | 0.8729 | 0.8736 |
| Average | 0.8445 | 0.9518 | 0.9031 | 0.8869 | 0.8873 |
Full nDCG Results
| Language | nDCG@1 | nDCG@3 | nDCG@5 | nDCG@10 | nDCG@20 | nDCG@100 | nDCG@1000 |
|---|---|---|---|---|---|---|---|
| Python | 0.8950 | 0.9424 | 0.9457 | 0.9476 | 0.9494 | 0.9500 | 0.9501 |
| JavaScript | 0.7470 | 0.8036 | 0.8131 | 0.8187 | 0.8268 | 0.8305 | 0.8351 |
| Go | 0.9230 | 0.9586 | 0.9624 | 0.9630 | 0.9633 | 0.9640 | 0.9642 |
| Ruby | 0.7920 | 0.8508 | 0.8591 | 0.8640 | 0.8673 | 0.8725 | 0.8743 |
| Java | 0.8880 | 0.9239 | 0.9280 | 0.9313 | 0.9323 | 0.9345 | 0.9352 |
| PHP | 0.8220 | 0.8785 | 0.8874 | 0.8937 | 0.8949 | 0.8972 | 0.8996 |
| Average | 0.8445 | 0.8930 | 0.8993 | 0.9031 | 0.9057 | 0.9081 | 0.9097 |
Observations
The model performs particularly well on Go, Python, and Java, achieving nDCG@10 scores of 0.9630, 0.9476, and 0.9313, respectively.
The weakest language is JavaScript, with nDCG@10 of 0.8187. This suggests that JavaScript may benefit from additional language-specific hard negatives, more diverse JavaScript training data, or further refinement on JavaScript-heavy retrieval datasets.
Overall, the model achieves 0.9031 average nDCG@10 on multilingual CodeSearchNetRetrieval, indicating strong performance for natural-language-to-code retrieval.
Training Details
Training Dataset
The model was trained on code retrieval pairs with multiple hard negatives per query.
The training data included:
querydocumentnegative_0...negative_99negative_scores
Hard negatives were mined from the Owl code retrieval corpus using Qwen3-Embedding-0.6B.
Loss
pylate.losses.cached_contrastive.CachedContrastive
Important Hyperparameters
| Hyperparameter | Value |
|---|---|
| Learning rate | 3e-6 |
| Max steps | 3000 |
| Batch size per device | 120 |
| Weight decay | 0.01 |
| Scheduler | cosine |
| Precision | bf16 |
| Gradient checkpointing | true |
| Optimizer | adamw_torch_fused |
Training Logs
The following table reports validation scores during the final 3000-step Cornstack refinement stage. These values were used for monitoring training progress and are not the final CodeSearchNetRetrieval test results reported above.
| Step | Python | Go | Java | JavaScript | PHP | Ruby |
|---|---|---|---|---|---|---|
| 100 | 0.9446 | 0.9211 | 0.8530 | 0.8031 | 0.8018 | 0.9055 |
| 500 | 0.9440 | 0.9256 | 0.8509 | 0.8122 | 0.8001 | 0.9026 |
| 1000 | 0.9454 | 0.9278 | 0.8495 | 0.8166 | 0.8002 | 0.9042 |
| 1500 | 0.9448 | 0.9275 | 0.8498 | 0.8172 | 0.8008 | 0.9080 |
| 2000 | 0.9447 | 0.9271 | 0.8504 | 0.8195 | 0.7994 | 0.9065 |
| 2500 | 0.9447 | 0.9272 | 0.8505 | 0.8175 | 0.7992 | 0.9072 |
| 3000 | 0.9454 | 0.9271 | 0.8495 | 0.8175 | 0.7998 | 0.9068 |
Recommendations
For best results:
- chunk long files before indexing
- keep function-level or class-level boundaries when possible
- preserve code formatting where possible
- include file paths or surrounding context if they are useful for retrieval
- use a first-stage retriever when indexing very large repositories
- use this model as a reranker when latency is important
Limitations
- The model is optimized for code retrieval and may not perform well as a general-purpose text embedding model.
- Performance may vary across programming languages and repository styles.
- Very short or ambiguous queries may produce unstable results.
- Long files should be chunked before indexing.
- The model retrieves semantically related code but does not verify correctness or security.
- The evaluation is focused on CodeSearchNetRetrieval; performance on other code intelligence tasks may differ.
- The model may inherit biases, artifacts, or licensing constraints from GitHub-collected code, technical documents, and retrieval datasets used during training.
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 5.3.0
- PyLate: 1.4.0
- Transformers: 4.56.2
- PyTorch: 2.8.0+cu128
- Accelerate: 1.12.0
- Datasets: 3.6.0
- Tokenizers: 0.22.1
Citation
If you use this model, please cite the relevant libraries and methods.
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084"
}
PyLate
@inproceedings{DBLP:conf/cikm/ChaffinS25,
author = {Antoine Chaffin and Rapha{"{e}}l Sourty},
title = {PyLate: Flexible Training and Retrieval for Late Interaction Models},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
year = {2025},
url = {https://github.com/lightonai/pylate},
doi = {10.1145/3746252.3761608},
}
CachedContrastive
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
- Downloads last month
- 52
Model tree for Shuu12121/OwlColBERT-v1
Base model
Shuu12121/OwlColBERT-v0