Access Request – Provide Required Information

Before accessing this model, please complete the form below.

Please provide the required information to access this model

🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

🧩 Summary

SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.

SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals. The model achieves state-of-the-art results across both long-paragraph STS evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.

🏗️ Architecture & Build Pipeline

SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.

SABER is designed for:

Semantic search
Retrieval-Augmented Generation (RAG)
Clustering
Intent detection
Semantic similarity
Document & paragraph embedding
Ranking and re-ranking systems
Multi-domain Saudi-language applications

This release is v0.1 — the first public version of SABER.

📌 Model Details

Model Name: SABER (Saudi Semantic Embedding)
Version: v0.1
Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
Task: Sentence Embeddings, Semantic Similarity, Retrieval
Training Objective: MNLR + Matryoshka Loss
Embedding Dimension: 768
License: Apache 2.0
Maintainer: Omartificial-Intelligence-Space

🧠 Motivation

Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:

Training specifically on Saudi-dialect triplet data.
Leveraging modern contrastive learning.
Creating robust embeddings suitable for production and research.

This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.

⚠️ Limitations

Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
Scope: Embeddings focus on semantic similarity, not syntax or classification.
Input Length: Long multi-document retrieval requires chunking.

📚 Training Data

SABER was trained on Omartificial-Intelligence-Space/SaudiDialect-Triplet-21, which contains:

2964 triplets (Anchor, Positive, Negative)
21 domains, including:
- Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
Real-world conversational phrasing
Carefully curated positive/negative pairs

The dataset includes natural variations in:

Word choice
Dialect morphology
Sentence structure
Discourse context
Multi-sentence reasoning

🔧 Training Methodology

SABER was fine-tuned using:

MultipleNegativesRankingLoss (MNLR)
- Transforms the embedding space so similar pairs cluster tightly.
- Each batch uses in-batch negatives, dramatically improving separation.
Matryoshka Representation Learning
- Ensures embeddings remain meaningful across different vector truncation sizes.
Triplet Ranking Optimization
- Anchor–Positive similarity maximized.
- Anchor–Negative similarity minimized.
- Margin-based structure preserved.
Optimizer & Hyperparameters

Hyperparameter	Value
Batch Size	16
Epochs	3
Loss	MNLR + Matryoshka
Precision	FP16
Negative Sampling	In-batch
Gradient Clip	Stable defaults
Warmup Ratio	0.1

🧪 Evaluation

SABER was evaluated on two benchmarks:

A) STS Evaluation (Saudi Paragraph-Level Dataset)

Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.

Metric	Score
Pearson	0.9189
Spearman	0.9045
MAE	1.69
MSE	3.82

These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.

B) Triplet Evaluation

Triplets derived from STS via (score ≥3 positive, score ≤1 negative).

Metric	Score
Basic Accuracy	0.9899
Margin > 0.05	0.9845
Margin > 0.10	0.9781
Margin > 0.20	0.9609

Excellent separation across strict thresholds.

🔍 Usage Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")

# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"

# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])

# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)

Training Details

Training Dataset

csv

Dataset: csv
Size: 2,964 training samples
Columns: text1 and text2
Approximate statistics based on the first 1000 samples:
text1 text2
type string string
details
min: 5 tokens
mean: 10.36 tokens
max: 22 tokens

min: 4 tokens
mean: 10.28 tokens
max: 19 tokens

	text1	text2
type	string	string
details	min: 5 tokens mean: 10.36 tokens max: 22 tokens	min: 4 tokens mean: 10.28 tokens max: 19 tokens

Samples:

text1	text2
`هل فيه رحلات بحرية للأطفال في جدة؟`	`ودي أعرف عن جولات بحرية للأطفال في جدة`
`ودي أحجز تذكرة طيران للرياض الأسبوع الجاي`	`ناوي أشتري تذكرة للرياض الأسبوع الجاي`
`عطوني أفضل فندق قريب من مطار جدة`	`أبي فندق قريب من المطار`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768
    ],
    "matryoshka_weights": [
        1
    ],
    "n_dims_per_step": -1
}

Citation

📌 Commercial Use

Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:

📩 [email protected]

If you use this model in academic work, please cite:

@inproceedings{nacar-saber-2025,
    title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
    author = "Nacar, Omer",
    year = "2025",
    url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}