Vietnamese Name Gender Classification

Model name: vn-gender-name-classification
Model type: Classical ML (TF-IDF + Logistic Regression + Linear SVM)
Task: Gender classification from Vietnamese full names
Author: Le Hong Duc - [email protected] - https://www.linkedin.com/in/hongduc96/

1. Overview

This model predicts binary gender (Nam / Nữ – male/female) from Vietnamese full names. It is trained on a dataset of Vietnamese names labeled with binary gender, using character n-gram TF-IDF features and two base models:

Logistic Regression (LR)
Linear SVM
This model is designed to be:
Lightweight – classical ML, no GPUs required
Easy to integrate – scikit-learn pipelines saved as .joblib
Practical – includes a clear production decision rule and an ambiguity flag
Accuracy Model: ~96.5%

2. Training data

Source: collection of Vietnamese full names with gender labels on a public website
Labels: Giới tính ∈ {Nam, Nữ} (binary male/female)
Size:
- Training: ~50,000 labeled names
- Test: ~15,000 names (held-out evaluation set)

2.1. Preprocessing

For each name:

Normalize whitespace (strip, collapse multiple spaces).
Lowercase the text, but keep Vietnamese accents (diacritics).
Create an additional variant name_wo_surname by:
- Splitting on spaces.
- Dropping the first token (surname).
- Keeping middle + given name. Example:
"Nguyễn Hoàng Phúc" →
- full_name_clean = "nguyễn hoàng phúc"
- name_wo_surname = "hoàng phúc" Both forms are used for training to make the model more robust to how names are provided.

2.2. Evaluation

On the held-out test set (~15K names), with results:

Accuracy: ~96.5%

Proportion of names marked ambiguous: ~2.6%

Class	Precision	Recall	F1-score	Support
Nam	0.956	0.964	0.960	6637
Nữ	0.972	0.965	0.968	8409
Accuracy			0.965	15046
Macro avg	0.964	0.965	0.964	15046
Weighted avg	0.965	0.965	0.965	15046

Confusion matrix (Nam/Nữ) – PROD:

Actual \ Predicted	Nam	Nữ
Nam	6401	236
Nữ	298	8111

Nam: 6,401 correctly predicted vs. 236 misclassified as “Nữ”.
Nữ: 8,111 correctly predicted vs. 298 misclassified as “Nam”. Number of names marked ambiguous by PROD: 395 / 15046 (2.63%)

3. Model architecture

3.1. Features

Both LR and SVM use the same character-level TF-IDF features:

TfidfVectorizer with:
- analyzer="char_wb" (character n-grams within word boundaries)
- ngram_range=(2, 6) (bigrams to 6-grams)
- min_df=2 (ignore n-grams that appear in fewer than 2 documents) This allows the model to learn patterns from substrings of names, which is very effective for Vietnamese given names.

3.2. Base models

Two scikit-learn pipelines are trained:

model_lr.joblib
- TfidfVectorizer → LogisticRegression(max_iter=1000, n_jobs=-1)
- Supports predict_proba for probability estimates.
model_svm.joblib
- TfidfVectorizer → LinearSVC
- Strong linear classifier, but outputs only class labels (no probabilities).

3.3. Production ensemble logic (simplified)

For an input full name full_name_raw:

Preprocessing
- full_name_clean = normalize_name_keep_accent(full_name_raw)
- name_wo_surname = drop_surname(full_name_clean)
SVM predictions
- SVM_FULL = model_svm.predict([full_name_clean])
- SVM_WO = model_svm.predict([name_wo_surname])
LR combined probabilities
- Get probabilities from LR on both text variants:
  - p_Nam_full, p_Nu_full = LR(full_name_clean)
  - p_Nam_wo, p_Nu_wo = LR(name_wo_surname)
- Average the Nam probability:
  - p_Nam_avg = 0.5 * (p_Nam_full + p_Nam_wo)
  - p_Nu_avg = 1.0 - p_Nam_avg
- LR_label = "Nam" if p_Nam_avg >= p_Nu_avg else "Nữ"
- LR_conf = max(p_Nam_avg, p_Nu_avg)
Decision rule

If SVM_FULL == SVM_WO (SVMs agree):
- final_label = SVM_FULL
- Mark as ambiguous if LR_conf < lr_conf_clear (default 0.60).
If SVM_FULL != SVM_WO (SVMs disagree, hard case):
- If LR_conf >= lr_conf_as_referee (default 0.70), follow LR:
  final_label = LR_label
- Otherwise, default to SVM_WO:
  final_label = SVM_WO
- Mark as ambiguous = True. The final output includes:
final_label – production gender prediction ("Nam" or "Nữ")
LR_conf – LR combined confidence
is_ambiguous – whether this name is considered ambiguous by the ensemble

Intended use & limitations Intended use

Analytics and statistical exploration of Vietnamese name data.
Educational demos of classical ML methods on Vietnamese text.
Lightweight baseline for experiments involving Vietnamese full names.

Limitations

The model predicts only binary gender (Nam / Nữ) from naming patterns.
It does not capture the full diversity of gender identities or self-determined gender.
It may be inaccurate or biased for:
- Rare or gender-neutral names
- Non-Vietnamese names, abbreviations, nicknames
- Names from under-represented regions or time periods in the training data

This model should not be used for high-stakes decisions that affect individual rights, access to services, or opportunities, or in any context where misclassification could cause harm or discrimination.

Downloads last month: -

hongducle96
/

vietnamese-name-gender-classification