ele-sage commited on
Commit
371f899
·
verified ·
1 Parent(s): d0da7d1

distilbert-base-uncased-name-classifier

Browse files
README.md CHANGED
@@ -1,147 +1,80 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- - fr
6
- base_model:
7
- - distilbert/distilbert-base-uncased
8
- datasets:
9
- - ele-sage/person-company-names-classification
 
 
 
 
 
10
  ---
11
 
12
- # Model Card for ele-sage/distilbert-base-uncased-name-classifier
 
13
 
14
- This model is a high-performance binary text classifier, fine-tuned from `distilbert-base-uncased`. Its purpose is to distinguish between a **person's name** and a **company/organization name** with high accuracy.
15
 
16
- This version has been specifically trained on an augmented dataset to be robust to various name formats, including `FirstName LastName`, `LastName, FirstName`, and the ambiguous `LastName FirstName`.
 
 
 
 
 
 
17
 
18
- ## Model Details
19
 
20
- ### Model Description
21
 
22
- - **Developed by:** ele-sage
23
- - **Model type:** `distilbert-for-sequence-classification`
24
- - **Language(s) (NLP):** English, French
25
- - **License:** MIT
26
- - **Finetuned from model:** `distilbert-base-uncased`
27
 
28
- ## Uses
29
 
30
- ### Direct Use
31
 
32
- This model is intended to be used for text classification. Given a string, it will return a label indicating whether the string is a `Person` or a `Company`.
33
 
34
- ```python
35
- from transformers import pipeline
36
 
37
- # Load the classification pipeline
38
- classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
39
 
40
- # The model was trained with LABEL_0 = Person, LABEL_1 = Company
41
- results = classifier([
42
- "Satya Nadella",
43
- "Global Innovations Inc.",
44
- "Martinez, Alonso" # Now correctly handled
45
- ])
 
 
 
46
 
47
- # For clarity, let's map the labels
48
- label_map = {"LABEL_0": "Person", "LABEL_1": "Company"}
49
- for result in results:
50
- print(f"Text: '{result['text']}', Prediction: {label_map.get(result['label'])}, Score: {result['score']:.4f}")
51
- ```
52
 
53
- ### Downstream Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- This model is a key component of a two-stage name processing pipeline. It is designed to be used as a fast, efficient "gatekeeper" to first identify person names before passing them to a more complex parsing model, such as `ele-sage/distilbert-base-uncased-name-splitter`.
56
 
57
- ### Out-of-Scope Use
58
 
59
- - This model is not a general-purpose classifier. It is highly specialized for distinguishing persons from companies and will not perform well on other classification tasks (e.g., sentiment analysis).
60
- - It does **not** split or parse names. It only classifies the entire string.
61
-
62
- ## Bias, Risks, and Limitations
63
-
64
- - **Geographic & Cultural Bias:** The training data is heavily biased towards North American (Canadian) person names and Quebec-based company names. The model will be less accurate when classifying names from other cultural or geographic origins.
65
- - **Ambiguity:** Certain names can legitimately be both a person's name and a company's name (e.g., "Ford"). In these cases, the model makes a statistical guess based on its training data, which may not always align with the specific context.
66
- - **Data Source:** The person name data is derived from a Facebook data leak and contains noise. While a rigorous cleaning process was applied, the model may have learned from some spurious data.
67
-
68
- ## How to Get Started with the Model
69
-
70
- Use the code below to get started with the model.
71
-
72
- ```python
73
- from transformers import pipeline
74
-
75
- # Load the pipeline
76
- classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
77
-
78
- # Define a mapping for the labels to make the output readable
79
- label_map = {"LABEL_0": "Person", "LABEL_1": "Company"}
80
-
81
- def classify_name(name_str):
82
- result = classifier(name_str)[0]
83
- return label_map.get(result['label']), result['score']
84
-
85
- # --- Examples ---
86
- print(f"'Alonso Sarmiento Martinez' -> {classify_name('Alonso Sarmiento Martinez')}")
87
- print(f"'Microsoft Inc.' -> {classify_name('Microsoft Inc.')}")
88
- print(f"'Schwehr, Dave' -> {classify_name('Schwehr, Dave')}")
89
- print(f"'Ford' -> {classify_name('Ford')} (An ambiguous case)")
90
- ```
91
-
92
- ## Training Details
93
-
94
- ### Training Data
95
-
96
- The model was trained on the `ele-sage/person-company-names-classification` dataset, a custom-curated and balanced dataset of **7,892,165 examples** constructed from two primary sources:
97
-
98
- 1. **Person Names Source:** An AI-cleaned subset of a large CSV file of Canadian names, originally from a Facebook data leak.
99
- 2. **Company Names Source:** A filtered subset of the public data from the [Quebec Enterprise Register](https://www.registreentreprises.gouv.qc.ca/RQAnonymeGR/GR/GR03/GR03A2_22A_PIU_RecupDonnPub_PC/FichierDonneesOuvertes.aspx).
100
-
101
- ### Training Procedure
102
-
103
- #### Preprocessing & Curation
104
-
105
- The dataset was carefully curated to improve model robustness and real-world performance.
106
-
107
- 1. **Data Augmentation (Person Names):** To ensure the model could handle various formats, the person name data was augmented into a **50/25/25 split**:
108
- - **50%** was formatted as `FirstName LastName`.
109
- - **25%** was formatted as the unambiguous `LastName, FirstName`.
110
- - **25%** was formatted as the ambiguous `LastName FirstName`.
111
- - These examples were assigned the label `0` (Person).
112
-
113
- 2. **Company Data Curation:**
114
- - The dataset was filtered to remove extremely long company names (over 75 characters) that often contained legal descriptions. Numbered companies were kept as a strong signal.
115
- - These examples were assigned the label `1` (Company).
116
-
117
- 3. **Final Dataset:** The augmented person data and the curated company data were combined and thoroughly shuffled.
118
-
119
- #### Training Hyperparameters
120
-
121
- - **Framework:** Transformers `Trainer`
122
- - **Training regime:** `bf16`
123
- - **Epochs:** 3
124
- - **Batch Size:** 1024
125
- - **Optimizer:** AdamW
126
- - **Learning Rate:** `2e-5`
127
- - **Warmup Steps:** 250
128
- - **Evaluation Strategy:** Every `1000` steps
129
-
130
- ## Evaluation
131
-
132
- ### Metrics
133
-
134
- The model's performance is evaluated using **Accuracy**, which is a suitable metric for this well-balanced, binary classification task.
135
-
136
- - **Accuracy:** What percentage of names (both persons and companies) did the model classify correctly?
137
-
138
- ### Results
139
-
140
- The final model was selected based on the highest accuracy achieved on the validation set during training. This ensures the saved model represents the point of peak performance before overfitting began.
141
-
142
- | Metric | Value |
143
- | :--- | :--- |
144
- | **eval_accuracy** | **99.36%** |
145
- | **eval_loss**| **0.0236** |
146
-
147
- This result demonstrates a high degree of accuracy and confidence on the unseen validation data.
 
1
  ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: distilbert/distilbert-base-uncased
5
+ tags:
6
+ - generated_from_trainer
7
+ metrics:
8
+ - accuracy
9
+ - precision
10
+ - recall
11
+ - f1
12
+ model-index:
13
+ - name: distilbert-base-uncased-name-classifier
14
+ results: []
15
  ---
16
 
17
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
+ should probably proofread and complete it, then remove this comment. -->
19
 
20
+ # distilbert-base-uncased-name-classifier
21
 
22
+ This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on an unknown dataset.
23
+ It achieves the following results on the evaluation set:
24
+ - Loss: 0.0230
25
+ - Accuracy: 0.9937
26
+ - Precision: 0.9983
27
+ - Recall: 0.9904
28
+ - F1: 0.9943
29
 
30
+ ## Model description
31
 
32
+ More information needed
33
 
34
+ ## Intended uses & limitations
 
 
 
 
35
 
36
+ More information needed
37
 
38
+ ## Training and evaluation data
39
 
40
+ More information needed
41
 
42
+ ## Training procedure
 
43
 
44
+ ### Training hyperparameters
 
45
 
46
+ The following hyperparameters were used during training:
47
+ - learning_rate: 2e-05
48
+ - train_batch_size: 256
49
+ - eval_batch_size: 256
50
+ - seed: 42
51
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
52
+ - lr_scheduler_type: linear
53
+ - lr_scheduler_warmup_steps: 1000
54
+ - num_epochs: 1
55
 
56
+ ### Training results
 
 
 
 
57
 
58
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
59
+ |:-------------:|:------:|:-----:|:---------------:|:--------:|:---------:|:------:|:------:|
60
+ | 0.0397 | 0.0718 | 2000 | 0.0405 | 0.9885 | 0.9981 | 0.9812 | 0.9896 |
61
+ | 0.0324 | 0.1435 | 4000 | 0.0303 | 0.9914 | 0.9970 | 0.9875 | 0.9923 |
62
+ | 0.031 | 0.2153 | 6000 | 0.0295 | 0.9914 | 0.9938 | 0.9907 | 0.9923 |
63
+ | 0.0295 | 0.2870 | 8000 | 0.0271 | 0.9924 | 0.9970 | 0.9894 | 0.9932 |
64
+ | 0.0275 | 0.3588 | 10000 | 0.0262 | 0.9926 | 0.9964 | 0.9904 | 0.9934 |
65
+ | 0.0281 | 0.4305 | 12000 | 0.0256 | 0.9930 | 0.9981 | 0.9893 | 0.9937 |
66
+ | 0.0244 | 0.5023 | 14000 | 0.0272 | 0.9926 | 0.9991 | 0.9876 | 0.9933 |
67
+ | 0.0229 | 0.5740 | 16000 | 0.0254 | 0.9931 | 0.9970 | 0.9907 | 0.9938 |
68
+ | 0.0264 | 0.6458 | 18000 | 0.0248 | 0.9932 | 0.9986 | 0.9892 | 0.9939 |
69
+ | 0.0258 | 0.7175 | 20000 | 0.0237 | 0.9934 | 0.9983 | 0.9899 | 0.9941 |
70
+ | 0.0236 | 0.7893 | 22000 | 0.0234 | 0.9936 | 0.9982 | 0.9903 | 0.9943 |
71
+ | 0.0253 | 0.8610 | 24000 | 0.0231 | 0.9936 | 0.9979 | 0.9907 | 0.9943 |
72
+ | 0.0248 | 0.9328 | 26000 | 0.0230 | 0.9937 | 0.9983 | 0.9904 | 0.9943 |
73
 
 
74
 
75
+ ### Framework versions
76
 
77
+ - Transformers 4.57.1
78
+ - Pytorch 2.9.0+cu128
79
+ - Datasets 4.4.1
80
+ - Tokenizers 0.22.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,24 +1,32 @@
1
- {
2
- "activation": "gelu",
3
- "architectures": [
4
- "DistilBertForSequenceClassification"
5
- ],
6
- "attention_dropout": 0.1,
7
- "dim": 768,
8
- "dropout": 0.1,
9
- "dtype": "float32",
10
- "hidden_dim": 3072,
11
- "initializer_range": 0.02,
12
- "max_position_embeddings": 512,
13
- "model_type": "distilbert",
14
- "n_heads": 12,
15
- "n_layers": 6,
16
- "pad_token_id": 0,
17
- "problem_type": "single_label_classification",
18
- "qa_dropout": 0.1,
19
- "seq_classif_dropout": 0.2,
20
- "sinusoidal_pos_embds": false,
21
- "tie_weights_": true,
22
- "transformers_version": "4.57.0",
23
- "vocab_size": 30522
24
- }
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation": "gelu",
3
+ "architectures": [
4
+ "DistilBertForSequenceClassification"
5
+ ],
6
+ "attention_dropout": 0.1,
7
+ "dim": 768,
8
+ "dropout": 0.1,
9
+ "dtype": "float32",
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "PERSON",
13
+ "1": "COMPANY"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "label2id": {
17
+ "COMPANY": 1,
18
+ "PERSON": 0
19
+ },
20
+ "max_position_embeddings": 512,
21
+ "model_type": "distilbert",
22
+ "n_heads": 12,
23
+ "n_layers": 6,
24
+ "pad_token_id": 0,
25
+ "problem_type": "single_label_classification",
26
+ "qa_dropout": 0.1,
27
+ "seq_classif_dropout": 0.2,
28
+ "sinusoidal_pos_embds": false,
29
+ "tie_weights_": true,
30
+ "transformers_version": "4.57.1",
31
+ "vocab_size": 30522
32
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6ca4522dedcf53d36cec8ed0593ccd247ec0e622af88668a670c7c9ec7ac541f
3
  size 267832560
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:348d244e9a1f1cf3da7e6668e29924033194e06b0e80c4ab90588d0f11ca9bd9
3
  size 267832560
runs/Dec07_11-33-21_elesage-pc/events.out.tfevents.1765125303.elesage-pc.29575.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b5a0e2404510a608b3b32be9698b6ed7831a120e8a9bd29d5f408d675246389
3
+ size 34031
runs/Dec07_11-48-14_elesage-pc/events.out.tfevents.1765126195.elesage-pc.37789.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:666026917b9ea4c7d4fd129ba1c37c5d36ddcf178e404ef8c0f1678fc506c260
3
+ size 13646
runs/Dec07_19-14-01_elesage-pc/events.out.tfevents.1765152955.elesage-pc.189935.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9b96b3155e7d9bafc72ee809af26ae42326b917969ac3badae0fe239bbe62db
3
+ size 41158
special_tokens_map.json CHANGED
@@ -1,7 +1,7 @@
1
- {
2
- "cls_token": "[CLS]",
3
- "mask_token": "[MASK]",
4
- "pad_token": "[PAD]",
5
- "sep_token": "[SEP]",
6
- "unk_token": "[UNK]"
7
- }
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json CHANGED
@@ -2,7 +2,7 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 512,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 120,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
tokenizer_config.json CHANGED
@@ -1,56 +1,56 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "[PAD]",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "100": {
12
- "content": "[UNK]",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "101": {
20
- "content": "[CLS]",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "102": {
28
- "content": "[SEP]",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "103": {
36
- "content": "[MASK]",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- }
43
- },
44
- "clean_up_tokenization_spaces": false,
45
- "cls_token": "[CLS]",
46
- "do_lower_case": true,
47
- "extra_special_tokens": {},
48
- "mask_token": "[MASK]",
49
- "model_max_length": 512,
50
- "pad_token": "[PAD]",
51
- "sep_token": "[SEP]",
52
- "strip_accents": null,
53
- "tokenize_chinese_chars": true,
54
- "tokenizer_class": "DistilBertTokenizer",
55
- "unk_token": "[UNK]"
56
- }
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "DistilBertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b73d22c3da4e85db531a48df0663f48e08adecaba4884f83f1140e57adf1465e
3
- size 5841
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4cc87171a4c10cd5c65823f0406598b1e6064512435b585c4d33fd491196ffc1
3
+ size 5905