Access Request – Provide Required Information

Before accessing this model, please complete the form below.

Please provide the required information to access this model

Log in or Sign Up to review the conditions and access this model content.

🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

Black Elegant Minimalist Profile LinkedIn Banner

🧩 Summary

SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.

SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals. The model achieves state-of-the-art results across both long-paragraph STS evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.

🏗️ Architecture & Build Pipeline

SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.

SABER Training Pipeline

SABER is designed for:

  • Semantic search
  • Retrieval-Augmented Generation (RAG)
  • Clustering
  • Intent detection
  • Semantic similarity
  • Document & paragraph embedding
  • Ranking and re-ranking systems
  • Multi-domain Saudi-language applications

This release is v0.1 — the first public version of SABER.

📌 Model Details

  • Model Name: SABER (Saudi Semantic Embedding)
  • Version: v0.1
  • Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
  • Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
  • Task: Sentence Embeddings, Semantic Similarity, Retrieval
  • Training Objective: MNLR + Matryoshka Loss
  • Embedding Dimension: 768
  • License: Apache 2.0
  • Maintainer: Omartificial-Intelligence-Space

🧠 Motivation

Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:

  1. Training specifically on Saudi-dialect triplet data.
  2. Leveraging modern contrastive learning.
  3. Creating robust embeddings suitable for production and research.

This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.


⚠️ Limitations

  1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
  2. Scope: Embeddings focus on semantic similarity, not syntax or classification.
  3. Input Length: Long multi-document retrieval requires chunking.

📚 Training Data

SABER was trained on Omartificial-Intelligence-Space/SaudiDialect-Triplet-21, which contains:

  • 2964 triplets (Anchor, Positive, Negative)
  • 21 domains, including:
    • Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
  • Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
  • Real-world conversational phrasing
  • Carefully curated positive/negative pairs

The dataset includes natural variations in:

  • Word choice
  • Dialect morphology
  • Sentence structure
  • Discourse context
  • Multi-sentence reasoning

🔧 Training Methodology

SABER was fine-tuned using:

  1. MultipleNegativesRankingLoss (MNLR)

    • Transforms the embedding space so similar pairs cluster tightly.
    • Each batch uses in-batch negatives, dramatically improving separation.
  2. Matryoshka Representation Learning

    • Ensures embeddings remain meaningful across different vector truncation sizes.
  3. Triplet Ranking Optimization

    • Anchor–Positive similarity maximized.
    • Anchor–Negative similarity minimized.
    • Margin-based structure preserved.
  4. Optimizer & Hyperparameters

Hyperparameter Value
Batch Size 16
Epochs 3
Loss MNLR + Matryoshka
Precision FP16
Negative Sampling In-batch
Gradient Clip Stable defaults
Warmup Ratio 0.1

🧪 Evaluation

SABER was evaluated on two benchmarks:

A) STS Evaluation (Saudi Paragraph-Level Dataset)

Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.

Metric Score
Pearson 0.9189
Spearman 0.9045
MAE 1.69
MSE 3.82

These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.

B) Triplet Evaluation

Triplets derived from STS via (score ≥3 positive, score ≤1 negative).

Metric Score
Basic Accuracy 0.9899
Margin > 0.05 0.9845
Margin > 0.10 0.9781
Margin > 0.20 0.9609

Excellent separation across strict thresholds.


🔍 Usage Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")

# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"

# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])

# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)

Training Details

Training Dataset

csv

  • Dataset: csv
  • Size: 2,964 training samples
  • Columns: text1 and text2
  • Approximate statistics based on the first 1000 samples:
    text1 text2
    type string string
    details
    • min: 5 tokens
    • mean: 10.36 tokens
    • max: 22 tokens
    • min: 4 tokens
    • mean: 10.28 tokens
    • max: 19 tokens
  • Samples:
    text1 text2
    هل فيه رحلات بحرية للأطفال في جدة؟ ودي أعرف عن جولات بحرية للأطفال في جدة
    ودي أحجز تذكرة طيران للرياض الأسبوع الجاي ناوي أشتري تذكرة للرياض الأسبوع الجاي
    عطوني أفضل فندق قريب من مطار جدة أبي فندق قريب من المطار
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768
        ],
        "matryoshka_weights": [
            1
        ],
        "n_dims_per_step": -1
    }
    

Citation

📌 Commercial Use

Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:

📩 [email protected]

If you use this model in academic work, please cite:

@inproceedings{nacar-saber-2025,
    title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
    author = "Nacar, Omer",
    year = "2025",
    url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B

Base model

UBC-NLP/MARBERTv2
Finetuned
(3)
this model

Dataset used to train Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B

Collection including Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B