distilbert-base-uncased-name-classifier

Browse files

Files changed (10) hide show

README.md +61 -128
config.json +32 -24
model.safetensors +1 -1
runs/Dec07_11-33-21_elesage-pc/events.out.tfevents.1765125303.elesage-pc.29575.0 +3 -0
runs/Dec07_11-48-14_elesage-pc/events.out.tfevents.1765126195.elesage-pc.37789.0 +3 -0
runs/Dec07_19-14-01_elesage-pc/events.out.tfevents.1765152955.elesage-pc.189935.0 +3 -0
special_tokens_map.json +7 -7
tokenizer.json +1 -1
tokenizer_config.json +56 -56
training_args.bin +2 -2

README.md CHANGED Viewed

@@ -1,147 +1,80 @@
 ---
-license: mit
-language:
-- en
-- fr
-base_model:
-- distilbert/distilbert-base-uncased
-datasets:
-- ele-sage/person-company-names-classification
 ---
-# Model Card for ele-sage/distilbert-base-uncased-name-classifier
-This model is a high-performance binary text classifier, fine-tuned from `distilbert-base-uncased`. Its purpose is to distinguish between a **person's name** and a **company/organization name** with high accuracy.
-This version has been specifically trained on an augmented dataset to be robust to various name formats, including `FirstName LastName`, `LastName, FirstName`, and the ambiguous `LastName FirstName`.
-## Model Details
-### Model Description
-- **Developed by:** ele-sage
-- **Model type:** `distilbert-for-sequence-classification`
-- **Language(s) (NLP):** English, French
-- **License:** MIT
-- **Finetuned from model:** `distilbert-base-uncased`
-## Uses
-### Direct Use
-This model is intended to be used for text classification. Given a string, it will return a label indicating whether the string is a `Person` or a `Company`.
-```python
-from transformers import pipeline
-# Load the classification pipeline
-classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
-# The model was trained with LABEL_0 = Person, LABEL_1 = Company
-results = classifier([
-    "Satya Nadella",
-    "Global Innovations Inc.",
-    "Martinez, Alonso" # Now correctly handled
-])
-# For clarity, let's map the labels
-label_map = {"LABEL_0": "Person", "LABEL_1": "Company"}
-for result in results:
-    print(f"Text: '{result['text']}', Prediction: {label_map.get(result['label'])}, Score: {result['score']:.4f}")
-```
-### Downstream Use
-This model is a key component of a two-stage name processing pipeline. It is designed to be used as a fast, efficient "gatekeeper" to first identify person names before passing them to a more complex parsing model, such as `ele-sage/distilbert-base-uncased-name-splitter`.
-### Out-of-Scope Use
-- This model is not a general-purpose classifier. It is highly specialized for distinguishing persons from companies and will not perform well on other classification tasks (e.g., sentiment analysis).
-- It does **not** split or parse names. It only classifies the entire string.
-## Bias, Risks, and Limitations
-- **Geographic & Cultural Bias:** The training data is heavily biased towards North American (Canadian) person names and Quebec-based company names. The model will be less accurate when classifying names from other cultural or geographic origins.
-- **Ambiguity:** Certain names can legitimately be both a person's name and a company's name (e.g., "Ford"). In these cases, the model makes a statistical guess based on its training data, which may not always align with the specific context.
-- **Data Source:** The person name data is derived from a Facebook data leak and contains noise. While a rigorous cleaning process was applied, the model may have learned from some spurious data.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-```python
-from transformers import pipeline
-# Load the pipeline
-classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
-# Define a mapping for the labels to make the output readable
-label_map = {"LABEL_0": "Person", "LABEL_1": "Company"}
-def classify_name(name_str):
-    result = classifier(name_str)[0]
-    return label_map.get(result['label']), result['score']
-# --- Examples ---
-print(f"'Alonso Sarmiento Martinez' -> {classify_name('Alonso Sarmiento Martinez')}")
-print(f"'Microsoft Inc.' -> {classify_name('Microsoft Inc.')}")
-print(f"'Schwehr, Dave' -> {classify_name('Schwehr, Dave')}")
-print(f"'Ford' -> {classify_name('Ford')} (An ambiguous case)")
-```
-## Training Details
-### Training Data
-The model was trained on the `ele-sage/person-company-names-classification` dataset, a custom-curated and balanced dataset of **7,892,165 examples** constructed from two primary sources:
-1.  **Person Names Source:** An AI-cleaned subset of a large CSV file of Canadian names, originally from a Facebook data leak.
-2.  **Company Names Source:** A filtered subset of the public data from the [Quebec Enterprise Register](https://www.registreentreprises.gouv.qc.ca/RQAnonymeGR/GR/GR03/GR03A2_22A_PIU_RecupDonnPub_PC/FichierDonneesOuvertes.aspx).
-### Training Procedure
-#### Preprocessing & Curation
-The dataset was carefully curated to improve model robustness and real-world performance.
-1.  **Data Augmentation (Person Names):** To ensure the model could handle various formats, the person name data was augmented into a **50/25/25 split**:
-    -   **50%** was formatted as `FirstName LastName`.
-    -   **25%** was formatted as the unambiguous `LastName, FirstName`.
-    -   **25%** was formatted as the ambiguous `LastName FirstName`.
-    -   These examples were assigned the label `0` (Person).
-2.  **Company Data Curation:**
-    -   The dataset was filtered to remove extremely long company names (over 75 characters) that often contained legal descriptions. Numbered companies were kept as a strong signal.
-    -   These examples were assigned the label `1` (Company).
-3.  **Final Dataset:** The augmented person data and the curated company data were combined and thoroughly shuffled.
-#### Training Hyperparameters
-- **Framework:** Transformers `Trainer`
-- **Training regime:** `bf16`
-- **Epochs:** 3
-- **Batch Size:** 1024
-- **Optimizer:** AdamW
-- **Learning Rate:** `2e-5`
-- **Warmup Steps:** 250
-- **Evaluation Strategy:** Every `1000` steps
-## Evaluation
-### Metrics
-The model's performance is evaluated using **Accuracy**, which is a suitable metric for this well-balanced, binary classification task.
-- **Accuracy:** What percentage of names (both persons and companies) did the model classify correctly?
-### Results
-The final model was selected based on the highest accuracy achieved on the validation set during training. This ensures the saved model represents the point of peak performance before overfitting began.
-| Metric | Value |
-| :--- | :--- |
-| **eval_accuracy** | **99.36%** |
-| **eval_loss**| **0.0236** |
-This result demonstrates a high degree of accuracy and confidence on the unseen validation data.

 ---
+library_name: transformers
+license: apache-2.0
+base_model: distilbert/distilbert-base-uncased
+tags:
+- generated_from_trainer
+metrics:
+- accuracy
+- precision
+- recall
+- f1
+model-index:
+- name: distilbert-base-uncased-name-classifier
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# distilbert-base-uncased-name-classifier
+This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on an unknown dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.0230
+- Accuracy: 0.9937
+- Precision: 0.9983
+- Recall: 0.9904
+- F1: 0.9943
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 2e-05
+- train_batch_size: 256
+- eval_batch_size: 256
+- seed: 42
+- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 1000
+- num_epochs: 1
+### Training results
+| Training Loss | Epoch  | Step  | Validation Loss | Accuracy | Precision | Recall | F1     |
+|:-------------:|:------:|:-----:|:---------------:|:--------:|:---------:|:------:|:------:|
+| 0.0397        | 0.0718 | 2000  | 0.0405          | 0.9885   | 0.9981    | 0.9812 | 0.9896 |
+| 0.0324        | 0.1435 | 4000  | 0.0303          | 0.9914   | 0.9970    | 0.9875 | 0.9923 |
+| 0.031         | 0.2153 | 6000  | 0.0295          | 0.9914   | 0.9938    | 0.9907 | 0.9923 |
+| 0.0295        | 0.2870 | 8000  | 0.0271          | 0.9924   | 0.9970    | 0.9894 | 0.9932 |
+| 0.0275        | 0.3588 | 10000 | 0.0262          | 0.9926   | 0.9964    | 0.9904 | 0.9934 |
+| 0.0281        | 0.4305 | 12000 | 0.0256          | 0.9930   | 0.9981    | 0.9893 | 0.9937 |
+| 0.0244        | 0.5023 | 14000 | 0.0272          | 0.9926   | 0.9991    | 0.9876 | 0.9933 |
+| 0.0229        | 0.5740 | 16000 | 0.0254          | 0.9931   | 0.9970    | 0.9907 | 0.9938 |
+| 0.0264        | 0.6458 | 18000 | 0.0248          | 0.9932   | 0.9986    | 0.9892 | 0.9939 |
+| 0.0258        | 0.7175 | 20000 | 0.0237          | 0.9934   | 0.9983    | 0.9899 | 0.9941 |
+| 0.0236        | 0.7893 | 22000 | 0.0234          | 0.9936   | 0.9982    | 0.9903 | 0.9943 |
+| 0.0253        | 0.8610 | 24000 | 0.0231          | 0.9936   | 0.9979    | 0.9907 | 0.9943 |
+| 0.0248        | 0.9328 | 26000 | 0.0230          | 0.9937   | 0.9983    | 0.9904 | 0.9943 |
+### Framework versions
+- Transformers 4.57.1
+- Pytorch 2.9.0+cu128
+- Datasets 4.4.1
+- Tokenizers 0.22.1

config.json CHANGED Viewed

@@ -1,24 +1,32 @@
-{
-  "activation": "gelu",
-  "architectures": [
-    "DistilBertForSequenceClassification"
-  ],
-  "attention_dropout": 0.1,
-  "dim": 768,
-  "dropout": 0.1,
-  "dtype": "float32",
-  "hidden_dim": 3072,
-  "initializer_range": 0.02,
-  "max_position_embeddings": 512,
-  "model_type": "distilbert",
-  "n_heads": 12,
-  "n_layers": 6,
-  "pad_token_id": 0,
-  "problem_type": "single_label_classification",
-  "qa_dropout": 0.1,
-  "seq_classif_dropout": 0.2,
-  "sinusoidal_pos_embds": false,
-  "tie_weights_": true,
-  "transformers_version": "4.57.0",
-  "vocab_size": 30522
-}

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "PERSON",
+    "1": "COMPANY"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "COMPANY": 1,
+    "PERSON": 0
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.57.1",
+  "vocab_size": 30522
+}

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6ca4522dedcf53d36cec8ed0593ccd247ec0e622af88668a670c7c9ec7ac541f
 size 267832560

 version https://git-lfs.github.com/spec/v1
+oid sha256:348d244e9a1f1cf3da7e6668e29924033194e06b0e80c4ab90588d0f11ca9bd9
 size 267832560

runs/Dec07_11-33-21_elesage-pc/events.out.tfevents.1765125303.elesage-pc.29575.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b5a0e2404510a608b3b32be9698b6ed7831a120e8a9bd29d5f408d675246389
+size 34031

runs/Dec07_11-48-14_elesage-pc/events.out.tfevents.1765126195.elesage-pc.37789.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:666026917b9ea4c7d4fd129ba1c37c5d36ddcf178e404ef8c0f1678fc506c260
+size 13646

runs/Dec07_19-14-01_elesage-pc/events.out.tfevents.1765152955.elesage-pc.189935.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9b96b3155e7d9bafc72ee809af26ae42326b917969ac3badae0fe239bbe62db
+size 41158

special_tokens_map.json CHANGED Viewed

@@ -1,7 +1,7 @@
-{
-  "cls_token": "[CLS]",
-  "mask_token": "[MASK]",
-  "pad_token": "[PAD]",
-  "sep_token": "[SEP]",
-  "unk_token": "[UNK]"
-}

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json CHANGED Viewed

@@ -2,7 +2,7 @@
   "version": "1.0",
   "truncation": {
     "direction": "Right",
-    "max_length": 512,
     "strategy": "LongestFirst",
     "stride": 0
   },

   "version": "1.0",
   "truncation": {
     "direction": "Right",
+    "max_length": 120,
     "strategy": "LongestFirst",
     "stride": 0
   },

tokenizer_config.json CHANGED Viewed

@@ -1,56 +1,56 @@
-{
-  "added_tokens_decoder": {
-    "0": {
-      "content": "[PAD]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "100": {
-      "content": "[UNK]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "101": {
-      "content": "[CLS]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "102": {
-      "content": "[SEP]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "103": {
-      "content": "[MASK]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "clean_up_tokenization_spaces": false,
-  "cls_token": "[CLS]",
-  "do_lower_case": true,
-  "extra_special_tokens": {},
-  "mask_token": "[MASK]",
-  "model_max_length": 512,
-  "pad_token": "[PAD]",
-  "sep_token": "[SEP]",
-  "strip_accents": null,
-  "tokenize_chinese_chars": true,
-  "tokenizer_class": "DistilBertTokenizer",
-  "unk_token": "[UNK]"
-}

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b73d22c3da4e85db531a48df0663f48e08adecaba4884f83f1140e57adf1465e
-size 5841

 version https://git-lfs.github.com/spec/v1
+oid sha256:4cc87171a4c10cd5c65823f0406598b1e6064512435b585c4d33fd491196ffc1
+size 5905