Kazakh Hybrid Duplicate Detector

Overview

This repository contains a hybrid text similarity model for the Kazakh language, designed to detect duplicate and near-duplicate texts, including paraphrases.

The model combines lexical, statistical, and semantic approaches, achieving high accuracy while remaining computationally efficient compared to transformer-based models.

Model Architecture

The hybrid similarity score is computed as a weighted fusion of:

  • MinHash Jaccard similarity over word n-grams
  • Latent Semantic Analysis (LSA) using TF-IDF + Truncated SVD
  • Latent Dirichlet Allocation (LDA) topic similarity

Final score:

S = w_jaccard · Jaccard + w_lsa · Cosine_LSA + w_lda · Cosine_LDA

Training Data

The model was evaluated using the KazakhTextDuplicates dataset:

  • Dataset: Arailym-tleubayeva/KazakhTextDuplicates
  • Each sample consists of an original text and its modified (paraphrased) version.

Evaluation Results

At a decision threshold of 0.7, the model achieves:

  • Precision: 1.00
  • Recall: 0.736
  • F1-score: 0.848

The evaluation confirms high precision with competitive recall and low inference cost.

Intended Use

  • Plagiarism detection
  • Duplicate content detection
  • Text similarity analysis for Kazakh NLP
  • Educational and research applications

Limitations

  • Designed for Kazakh language only
  • Requires preprocessing consistent with training setup
  • Not a generative model

Implementation

The model is implemented in Python using:

  • scikit-learn
  • datasketch
  • NumPy / Pandas

Authors

Svitlana Biloshchytska¹²*, Arailym Tleubayeva³*, Oleksandr Kuchanskyi¹⁴⁵*,
Andrii Biloshchytskyi¹², Yurii Andrashko⁶, Sapar Toxanov⁷,
Aidos Mukhatayev⁸, Saltanat Sharipova⁹

¹ Department of Computational and Data Science, Astana IT University, Kazakhstan
² Kyiv National University of Construction and Architecture, Ukraine
³ Department of Computer Engineering, Astana IT University, Kazakhstan
⁴ Uzhhorod National University, Ukraine
⁵ Igor Sikorsky Kyiv Polytechnic Institute, Ukraine
⁶ Uzhhorod National University, Ukraine
⁷ Astana IT University, Kazakhstan
⁸ Astana IT University, Kazakhstan
⁹ Astana IT University, Kazakhstan

Citation

Biloshchytska, S., Tleubayeva, A., Kuchanskyi, O., Biloshchytskyi, A., Andrashko, Y., Toxanov, S., Mukhatayev, A., & Sharipova, S. (2025). Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models. Applied Sciences, 15(12), 6707. https://doi.org/10.3390/app15126707

How to Download

Using huggingface_hub

from huggingface_hub import hf_hub_download

repo_id = "Arailym-tleubayeva/kazakh-hybrid-duplicate-detector"

tfidf_path = hf_hub_download(repo_id, "tfidf_vectorizer.joblib")
lsa_path   = hf_hub_download(repo_id, "lsa_svd.joblib")
lda_path   = hf_hub_download(repo_id, "lda_model.joblib")
meta_path  = hf_hub_download(repo_id, "meta.json")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Arailym-tleubayeva/kazakh-hybrid-duplicate-detector