distilbert-base-uncased-name-classifier
This model is a fine-tuned version of distilbert/distilbert-base-uncased on ele-sage/person-company-names-classification.
It achieves the following results on the evaluation set:
- Loss: 0.0225
- Accuracy: 0.9939
- Precision: 0.9979
- Recall: 0.9912
- F1: 0.9945
Model description
This model is a high-performance binary text classifier, fine-tuned from distilbert-base-uncased.
Its purpose is to distinguish between a person's name and a company/organization name with high accuracy.
Direct Use
This model is intended to be used for text classification. Given a string, it will return a label indicating whether the string is a Person or a Company.
from transformers import pipeline
classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
results = classifier([
"Satya Nadella",
"Global Innovations Inc.",
"Martinez, Alonso"
])
for result in results:
print(f"Text: '{result['text']}', Prediction: {result['label']}, Score: {result['score']:.4f}")
Downstream Use
This model is a key component of a two-stage name processing pipeline. It is designed to be used as a fast, efficient "gatekeeper" to first identify person names before passing them to a more complex parsing model, such as ele-sage/distilbert-base-uncased-name-splitter.
Out-of-Scope Use
- This model is not a general-purpose classifier. It is highly specialized for distinguishing persons from companies and will not perform well on other classification tasks (e.g., sentiment analysis).
Bias, Risks, and Limitations
- Geographic & Cultural Bias: The training data is heavily biased towards North American (Canadian) person names and Quebec-based company names. The model will be less accurate when classifying names from other cultural or geographic origins.
- Ambiguity: Certain names can legitimately be both a person's name and a company's name (e.g., "Ford"). In these cases, the model makes a statistical guess based on its training data, which may not always align with the specific context.
- Data Source: The person name data is derived from a Facebook data leak and contains noise. While a rigorous cleaning process was applied, the model may have learned from some spurious data.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 256
- eval_batch_size: 256
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 2
Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| 0.0323 | 0.1435 | 4000 | 0.0303 | 0.9915 | 0.9975 | 0.9872 | 0.9923 |
| 0.0297 | 0.2870 | 8000 | 0.0279 | 0.9923 | 0.9963 | 0.9899 | 0.9931 |
| 0.0283 | 0.4305 | 12000 | 0.0257 | 0.9929 | 0.9978 | 0.9895 | 0.9936 |
| 0.0229 | 0.5740 | 16000 | 0.0258 | 0.9932 | 0.9972 | 0.9905 | 0.9938 |
| 0.0263 | 0.7175 | 20000 | 0.0239 | 0.9934 | 0.9981 | 0.9901 | 0.9940 |
| 0.0256 | 0.8610 | 24000 | 0.0233 | 0.9935 | 0.9976 | 0.9908 | 0.9942 |
| 0.023 | 1.0046 | 28000 | 0.0233 | 0.9936 | 0.9976 | 0.9909 | 0.9943 |
| 0.0214 | 1.1481 | 32000 | 0.0231 | 0.9937 | 0.9986 | 0.9902 | 0.9944 |
| 0.0207 | 1.2916 | 36000 | 0.0232 | 0.9938 | 0.9984 | 0.9905 | 0.9944 |
| 0.0215 | 1.4351 | 40000 | 0.0229 | 0.9938 | 0.9978 | 0.9910 | 0.9944 |
| 0.0206 | 1.5786 | 44000 | 0.0232 | 0.9938 | 0.9976 | 0.9913 | 0.9944 |
| 0.0197 | 1.7221 | 48000 | 0.0229 | 0.9939 | 0.9978 | 0.9912 | 0.9945 |
| 0.0216 | 1.8656 | 52000 | 0.0225 | 0.9939 | 0.9979 | 0.9912 | 0.9945 |
Framework versions
- Transformers 4.57.1
- Pytorch 2.9.0+cu128
- Datasets 4.4.1
- Tokenizers 0.22.1
- Downloads last month
- 363
Model tree for ele-sage/distilbert-base-uncased-name-classifier
Base model
distilbert/distilbert-base-uncased