Vietnamese Name Gender Classification
Model name: vn-gender-name-classification
Model type: Classical ML (TF-IDF + Logistic Regression + Linear SVM)
Task: Gender classification from Vietnamese full names
Author: Le Hong Duc - [email protected] - https://www.linkedin.com/in/hongduc96/
1. Overview
This model predicts binary gender (Nam / Nữ – male/female) from Vietnamese full names. It is trained on a dataset of Vietnamese names labeled with binary gender, using character n-gram TF-IDF features and two base models:
- Logistic Regression (LR)
- Linear SVM
This model is designed to be: - Lightweight – classical ML, no GPUs required
- Easy to integrate – scikit-learn pipelines saved as
.joblib - Practical – includes a clear production decision rule and an ambiguity flag
- Accuracy Model: ~96.5%
2. Training data
- Source: collection of Vietnamese full names with gender labels on a public website
- Labels:
Giới tính ∈ {Nam, Nữ}(binary male/female) - Size:
- Training: ~50,000 labeled names
- Test: ~15,000 names (held-out evaluation set)
2.1. Preprocessing
For each name:
- Normalize whitespace (strip, collapse multiple spaces).
- Lowercase the text, but keep Vietnamese accents (diacritics).
- Create an additional variant
name_wo_surnameby:- Splitting on spaces.
- Dropping the first token (surname).
- Keeping middle + given name. Example:
"Nguyễn Hoàng Phúc"→full_name_clean = "nguyễn hoàng phúc"name_wo_surname = "hoàng phúc"Both forms are used for training to make the model more robust to how names are provided.
2.2. Evaluation
On the held-out test set (~15K names), with results:
- Accuracy: ~96.5%
- Proportion of names marked ambiguous: ~2.6%
Class Precision Recall F1-score Support Nam 0.956 0.964 0.960 6637 Nữ 0.972 0.965 0.968 8409 Accuracy 0.965 15046 Macro avg 0.964 0.965 0.964 15046 Weighted avg 0.965 0.965 0.965 15046
Confusion matrix (Nam/Nữ) – PROD:
| Actual \ Predicted | Nam | Nữ |
|---|---|---|
| Nam | 6401 | 236 |
| Nữ | 298 | 8111 |
- Nam: 6,401 correctly predicted vs. 236 misclassified as “Nữ”.
- Nữ: 8,111 correctly predicted vs. 298 misclassified as “Nam”.
Number of names marked ambiguous by PROD:
395 / 15046 (2.63%)
3. Model architecture
3.1. Features
Both LR and SVM use the same character-level TF-IDF features:
TfidfVectorizerwith:analyzer="char_wb"(character n-grams within word boundaries)ngram_range=(2, 6)(bigrams to 6-grams)min_df=2(ignore n-grams that appear in fewer than 2 documents) This allows the model to learn patterns from substrings of names, which is very effective for Vietnamese given names.
3.2. Base models
Two scikit-learn pipelines are trained:
model_lr.joblibTfidfVectorizer→LogisticRegression(max_iter=1000, n_jobs=-1)- Supports
predict_probafor probability estimates.
model_svm.joblibTfidfVectorizer→LinearSVC- Strong linear classifier, but outputs only class labels (no probabilities).
3.3. Production ensemble logic (simplified)
For an input full name full_name_raw:
Preprocessing
full_name_clean = normalize_name_keep_accent(full_name_raw)name_wo_surname = drop_surname(full_name_clean)
SVM predictions
SVM_FULL = model_svm.predict([full_name_clean])SVM_WO = model_svm.predict([name_wo_surname])
LR combined probabilities
- Get probabilities from LR on both text variants:
p_Nam_full, p_Nu_full = LR(full_name_clean)p_Nam_wo, p_Nu_wo = LR(name_wo_surname)
- Average the Nam probability:
p_Nam_avg = 0.5 * (p_Nam_full + p_Nam_wo)p_Nu_avg = 1.0 - p_Nam_avg
LR_label = "Nam" if p_Nam_avg >= p_Nu_avg else "Nữ"LR_conf = max(p_Nam_avg, p_Nu_avg)
- Get probabilities from LR on both text variants:
Decision rule
- If
SVM_FULL == SVM_WO(SVMs agree):final_label = SVM_FULL- Mark as ambiguous if
LR_conf < lr_conf_clear(default0.60).
- If
SVM_FULL != SVM_WO(SVMs disagree, hard case):- If
LR_conf >= lr_conf_as_referee(default0.70), follow LR:final_label = LR_label - Otherwise, default to
SVM_WO:final_label = SVM_WO - Mark as ambiguous = True. The final output includes:
- If
final_label– production gender prediction ("Nam"or"Nữ")LR_conf– LR combined confidenceis_ambiguous– whether this name is considered ambiguous by the ensemble
Intended use & limitations Intended use
- Analytics and statistical exploration of Vietnamese name data.
- Educational demos of classical ML methods on Vietnamese text.
- Lightweight baseline for experiments involving Vietnamese full names.
Limitations
- The model predicts only binary gender (Nam / Nữ) from naming patterns.
- It does not capture the full diversity of gender identities or self-determined gender.
- It may be inaccurate or biased for:
- Rare or gender-neutral names
- Non-Vietnamese names, abbreviations, nicknames
- Names from under-represented regions or time periods in the training data
This model should not be used for high-stakes decisions that affect individual rights, access to services, or opportunities, or in any context where misclassification could cause harm or discrimination.
- Downloads last month
- -