CARMANIA

Overview

CARMANIA is a self-supervised genomic language model that augments traditional next-token (NT) prediction with a Transition-Matrix (TM) loss. This auxiliary loss aligns predicted token transitions with empirically derived bigram (n-gram) statistics for each input sequence, enabling the model to capture higher-order dependencies and learn organism-specific sequence structures.

This model is particularly designed for DNA sequence modeling and has shown superior performance in both in-domain and out-of-domain genomic tasks.


πŸ“š Pretraining Dataset

This model is trained on the Scorpio Gene-Taxa dataset, developed by Refahi et al. (2025). It includes:

  • Species: 2,046 bacterial and archaeal genomes
  • Gene types: 497 distinct genes
  • Total base pairs: ~580 million
  • Training fragments: 547,523 DNA segments
  • Fragment length: 4,000 bp (padded if shorter)

This dataset is designed to preserve evolutionary and functional diversity across microbial taxa.


🧬 Tokenization & Transition Matrix

  • Tokenizer: Single-nucleotide (A, T, C, G) level tokenization to retain fine-grained features such as SNPs.
  • Transition Matrix: For each input, we compute a normalized 4Γ—4 bigram transition matrix, where each row represents a probability distribution over the next nucleotide. This matrix serves as ground truth for the TM loss and guides the model to learn biologically meaningful dependencies.


⚠️ Requirement: FlashAttention

This release is optimized to run with FlashAttention on Ampere/Ada/Hopper GPUs (e.g. A100, RTX 3090, H100).

If you have an A100 or another supported GPU, install FlashAttention:

pip install flash_attn --no-build-isolation

πŸ”§ Usage

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained(
    "MsAlEhR/carmania-4k-scp-gene-taxa",
    trust_remote_code=True,
    torch_dtype=torch.float16,   # fixed dtype (or autocast)
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "MsAlEhR/carmania-4k-scp-gene-taxa",
    trust_remote_code=True,
    model_max_length=4000,
)

inputs = tokenizer("ACGTAGGCTA", return_tensors="pt").to("cuda")

outputs = model(**inputs)
Downloads last month
4
Safetensors
Model size
83.9M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train MsAlEhR/carmania-4k-scp-gene-taxa