Kazakh Hybrid Duplicate Detector
Overview
This repository contains a hybrid text similarity model for the Kazakh language, designed to detect duplicate and near-duplicate texts, including paraphrases.
The model combines lexical, statistical, and semantic approaches, achieving high accuracy while remaining computationally efficient compared to transformer-based models.
Model Architecture
The hybrid similarity score is computed as a weighted fusion of:
- MinHash Jaccard similarity over word n-grams
- Latent Semantic Analysis (LSA) using TF-IDF + Truncated SVD
- Latent Dirichlet Allocation (LDA) topic similarity
Final score:
S = w_jaccard · Jaccard + w_lsa · Cosine_LSA + w_lda · Cosine_LDA
Training Data
The model was evaluated using the KazakhTextDuplicates dataset:
- Dataset:
Arailym-tleubayeva/KazakhTextDuplicates - Each sample consists of an original text and its modified (paraphrased) version.
Evaluation Results
At a decision threshold of 0.7, the model achieves:
- Precision: 1.00
- Recall: 0.736
- F1-score: 0.848
The evaluation confirms high precision with competitive recall and low inference cost.
Intended Use
- Plagiarism detection
- Duplicate content detection
- Text similarity analysis for Kazakh NLP
- Educational and research applications
Limitations
- Designed for Kazakh language only
- Requires preprocessing consistent with training setup
- Not a generative model
Implementation
The model is implemented in Python using:
- scikit-learn
- datasketch
- NumPy / Pandas
Authors
Svitlana Biloshchytska¹²*, Arailym Tleubayeva³*, Oleksandr Kuchanskyi¹⁴⁵*,
Andrii Biloshchytskyi¹², Yurii Andrashko⁶, Sapar Toxanov⁷,
Aidos Mukhatayev⁸, Saltanat Sharipova⁹
¹ Department of Computational and Data Science, Astana IT University, Kazakhstan
² Kyiv National University of Construction and Architecture, Ukraine
³ Department of Computer Engineering, Astana IT University, Kazakhstan
⁴ Uzhhorod National University, Ukraine
⁵ Igor Sikorsky Kyiv Polytechnic Institute, Ukraine
⁶ Uzhhorod National University, Ukraine
⁷ Astana IT University, Kazakhstan
⁸ Astana IT University, Kazakhstan
⁹ Astana IT University, Kazakhstan
Citation
Biloshchytska, S., Tleubayeva, A., Kuchanskyi, O., Biloshchytskyi, A., Andrashko, Y., Toxanov, S., Mukhatayev, A., & Sharipova, S. (2025). Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models. Applied Sciences, 15(12), 6707. https://doi.org/10.3390/app15126707
How to Download
Using huggingface_hub
from huggingface_hub import hf_hub_download
repo_id = "Arailym-tleubayeva/kazakh-hybrid-duplicate-detector"
tfidf_path = hf_hub_download(repo_id, "tfidf_vectorizer.joblib")
lsa_path = hf_hub_download(repo_id, "lsa_svd.joblib")
lda_path = hf_hub_download(repo_id, "lda_model.joblib")
meta_path = hf_hub_download(repo_id, "meta.json")