Vietnamese Name Gender Classification

Model name: vn-gender-name-classification
Model type: Classical ML (TF-IDF + Logistic Regression + Linear SVM)
Task: Gender classification from Vietnamese full names
Author: Le Hong Duc - [email protected] - https://www.linkedin.com/in/hongduc96/


1. Overview

This model predicts binary gender (Nam / Nữ – male/female) from Vietnamese full names. It is trained on a dataset of Vietnamese names labeled with binary gender, using character n-gram TF-IDF features and two base models:

  • Logistic Regression (LR)
  • Linear SVM
    This model is designed to be:
  • Lightweight – classical ML, no GPUs required
  • Easy to integrate – scikit-learn pipelines saved as .joblib
  • Practical – includes a clear production decision rule and an ambiguity flag
  • Accuracy Model: ~96.5%

2. Training data

  • Source: collection of Vietnamese full names with gender labels on a public website
  • Labels: Giới tính ∈ {Nam, Nữ} (binary male/female)
  • Size:
    • Training: ~50,000 labeled names
    • Test: ~15,000 names (held-out evaluation set)

2.1. Preprocessing

For each name:

  • Normalize whitespace (strip, collapse multiple spaces).
  • Lowercase the text, but keep Vietnamese accents (diacritics).
  • Create an additional variant name_wo_surname by:
    • Splitting on spaces.
    • Dropping the first token (surname).
    • Keeping middle + given name. Example:
  • "Nguyễn Hoàng Phúc"
    • full_name_clean = "nguyễn hoàng phúc"
    • name_wo_surname = "hoàng phúc" Both forms are used for training to make the model more robust to how names are provided.

2.2. Evaluation

On the held-out test set (~15K names), with results:

  • Accuracy: ~96.5%
  • Proportion of names marked ambiguous: ~2.6%
    Class Precision Recall F1-score Support
    Nam 0.956 0.964 0.960 6637
    Nữ 0.972 0.965 0.968 8409
    Accuracy 0.965 15046
    Macro avg 0.964 0.965 0.964 15046
    Weighted avg 0.965 0.965 0.965 15046

Confusion matrix (Nam/Nữ) – PROD:

Actual \ Predicted Nam Nữ
Nam 6401 236
Nữ 298 8111
  • Nam: 6,401 correctly predicted vs. 236 misclassified as “Nữ”.
  • Nữ: 8,111 correctly predicted vs. 298 misclassified as “Nam”. Number of names marked ambiguous by PROD: 395 / 15046 (2.63%)

3. Model architecture

3.1. Features

Both LR and SVM use the same character-level TF-IDF features:

  • TfidfVectorizer with:
    • analyzer="char_wb" (character n-grams within word boundaries)
    • ngram_range=(2, 6) (bigrams to 6-grams)
    • min_df=2 (ignore n-grams that appear in fewer than 2 documents) This allows the model to learn patterns from substrings of names, which is very effective for Vietnamese given names.

3.2. Base models

Two scikit-learn pipelines are trained:

  • model_lr.joblib
    • TfidfVectorizerLogisticRegression(max_iter=1000, n_jobs=-1)
    • Supports predict_proba for probability estimates.
  • model_svm.joblib
    • TfidfVectorizerLinearSVC
    • Strong linear classifier, but outputs only class labels (no probabilities).

3.3. Production ensemble logic (simplified)

For an input full name full_name_raw:

  1. Preprocessing

    • full_name_clean = normalize_name_keep_accent(full_name_raw)
    • name_wo_surname = drop_surname(full_name_clean)
  2. SVM predictions

    • SVM_FULL = model_svm.predict([full_name_clean])
    • SVM_WO = model_svm.predict([name_wo_surname])
  3. LR combined probabilities

    • Get probabilities from LR on both text variants:
      • p_Nam_full, p_Nu_full = LR(full_name_clean)
      • p_Nam_wo, p_Nu_wo = LR(name_wo_surname)
    • Average the Nam probability:
      • p_Nam_avg = 0.5 * (p_Nam_full + p_Nam_wo)
      • p_Nu_avg = 1.0 - p_Nam_avg
    • LR_label = "Nam" if p_Nam_avg >= p_Nu_avg else "Nữ"
    • LR_conf = max(p_Nam_avg, p_Nu_avg)
  4. Decision rule

  • If SVM_FULL == SVM_WO (SVMs agree):
    • final_label = SVM_FULL
    • Mark as ambiguous if LR_conf < lr_conf_clear (default 0.60).
  • If SVM_FULL != SVM_WO (SVMs disagree, hard case):
    • If LR_conf >= lr_conf_as_referee (default 0.70), follow LR:
      final_label = LR_label
    • Otherwise, default to SVM_WO:
      final_label = SVM_WO
    • Mark as ambiguous = True. The final output includes:
  • final_label – production gender prediction ("Nam" or "Nữ")
  • LR_conf – LR combined confidence
  • is_ambiguous – whether this name is considered ambiguous by the ensemble

Intended use & limitations Intended use

  • Analytics and statistical exploration of Vietnamese name data.
  • Educational demos of classical ML methods on Vietnamese text.
  • Lightweight baseline for experiments involving Vietnamese full names.

Limitations

  • The model predicts only binary gender (Nam / Nữ) from naming patterns.
  • It does not capture the full diversity of gender identities or self-determined gender.
  • It may be inaccurate or biased for:
    • Rare or gender-neutral names
    • Non-Vietnamese names, abbreviations, nicknames
    • Names from under-represented regions or time periods in the training data

This model should not be used for high-stakes decisions that affect individual rights, access to services, or opportunities, or in any context where misclassification could cause harm or discrimination.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using hongducle96/vietnamese-name-gender-classification 1