Kazakh Hybrid Duplicate Detector

Overview

This repository contains a hybrid text similarity model for the Kazakh language, designed to detect duplicate and near-duplicate texts, including paraphrases.

The model combines lexical, statistical, and semantic approaches, achieving high accuracy while remaining computationally efficient compared to transformer-based models.

Model Architecture

The hybrid similarity score is computed as a weighted fusion of:

MinHash Jaccard similarity over word n-grams
Latent Semantic Analysis (LSA) using TF-IDF + Truncated SVD
Latent Dirichlet Allocation (LDA) topic similarity

Final score:

S = w_jaccard · Jaccard + w_lsa · Cosine_LSA + w_lda · Cosine_LDA

Training Data

The model was evaluated using the KazakhTextDuplicates dataset:

Dataset: Arailym-tleubayeva/KazakhTextDuplicates
Each sample consists of an original text and its modified (paraphrased) version.

Evaluation Results

At a decision threshold of 0.7, the model achieves:

Precision: 1.00
Recall: 0.736
F1-score: 0.848

The evaluation confirms high precision with competitive recall and low inference cost.

Intended Use

Plagiarism detection
Duplicate content detection
Text similarity analysis for Kazakh NLP
Educational and research applications

Limitations

Designed for Kazakh language only
Requires preprocessing consistent with training setup
Not a generative model

Implementation

The model is implemented in Python using:

scikit-learn
datasketch
NumPy / Pandas

Authors

Svitlana Biloshchytska¹²*, Arailym Tleubayeva³*, Oleksandr Kuchanskyi¹⁴⁵*,
Andrii Biloshchytskyi¹², Yurii Andrashko⁶, Sapar Toxanov⁷,
Aidos Mukhatayev⁸, Saltanat Sharipova⁹

¹ Department of Computational and Data Science, Astana IT University, Kazakhstan
² Kyiv National University of Construction and Architecture, Ukraine
³ Department of Computer Engineering, Astana IT University, Kazakhstan
⁴ Uzhhorod National University, Ukraine
⁵ Igor Sikorsky Kyiv Polytechnic Institute, Ukraine
⁶ Uzhhorod National University, Ukraine
⁷ Astana IT University, Kazakhstan
⁸ Astana IT University, Kazakhstan
⁹ Astana IT University, Kazakhstan

Citation

Biloshchytska, S., Tleubayeva, A., Kuchanskyi, O., Biloshchytskyi, A., Andrashko, Y., Toxanov, S., Mukhatayev, A., & Sharipova, S. (2025). Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models. Applied Sciences, 15(12), 6707. https://doi.org/10.3390/app15126707

How to Download

Using huggingface_hub

from huggingface_hub import hf_hub_download

repo_id = "Arailym-tleubayeva/kazakh-hybrid-duplicate-detector"

tfidf_path = hf_hub_download(repo_id, "tfidf_vectorizer.joblib")
lsa_path   = hf_hub_download(repo_id, "lsa_svd.joblib")
lda_path   = hf_hub_download(repo_id, "lda_model.joblib")
meta_path  = hf_hub_download(repo_id, "meta.json")

Downloads last month: -; Downloads are not tracked for this model. How to track

Arailym-tleubayeva
/

kazakh-hybrid-duplicate-detector