distilbert-base-uncased-name-classifier
Browse files- README.md +61 -128
- config.json +32 -24
- model.safetensors +1 -1
- runs/Dec07_11-33-21_elesage-pc/events.out.tfevents.1765125303.elesage-pc.29575.0 +3 -0
- runs/Dec07_11-48-14_elesage-pc/events.out.tfevents.1765126195.elesage-pc.37789.0 +3 -0
- runs/Dec07_19-14-01_elesage-pc/events.out.tfevents.1765152955.elesage-pc.189935.0 +3 -0
- special_tokens_map.json +7 -7
- tokenizer.json +1 -1
- tokenizer_config.json +56 -56
- training_args.bin +2 -2
README.md
CHANGED
|
@@ -1,147 +1,80 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
This
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
## Model
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
- **Model type:** `distilbert-for-sequence-classification`
|
| 24 |
-
- **Language(s) (NLP):** English, French
|
| 25 |
-
- **License:** MIT
|
| 26 |
-
- **Finetuned from model:** `distilbert-base-uncased`
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
from transformers import pipeline
|
| 36 |
|
| 37 |
-
|
| 38 |
-
classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
-
label_map = {"LABEL_0": "Person", "LABEL_1": "Company"}
|
| 49 |
-
for result in results:
|
| 50 |
-
print(f"Text: '{result['text']}', Prediction: {label_map.get(result['label'])}, Score: {result['score']:.4f}")
|
| 51 |
-
```
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
This model is a key component of a two-stage name processing pipeline. It is designed to be used as a fast, efficient "gatekeeper" to first identify person names before passing them to a more complex parsing model, such as `ele-sage/distilbert-base-uncased-name-splitter`.
|
| 56 |
|
| 57 |
-
###
|
| 58 |
|
| 59 |
-
-
|
| 60 |
-
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
- **Geographic & Cultural Bias:** The training data is heavily biased towards North American (Canadian) person names and Quebec-based company names. The model will be less accurate when classifying names from other cultural or geographic origins.
|
| 65 |
-
- **Ambiguity:** Certain names can legitimately be both a person's name and a company's name (e.g., "Ford"). In these cases, the model makes a statistical guess based on its training data, which may not always align with the specific context.
|
| 66 |
-
- **Data Source:** The person name data is derived from a Facebook data leak and contains noise. While a rigorous cleaning process was applied, the model may have learned from some spurious data.
|
| 67 |
-
|
| 68 |
-
## How to Get Started with the Model
|
| 69 |
-
|
| 70 |
-
Use the code below to get started with the model.
|
| 71 |
-
|
| 72 |
-
```python
|
| 73 |
-
from transformers import pipeline
|
| 74 |
-
|
| 75 |
-
# Load the pipeline
|
| 76 |
-
classifier = pipeline("text-classification", model="ele-sage/distilbert-base-uncased-name-classifier")
|
| 77 |
-
|
| 78 |
-
# Define a mapping for the labels to make the output readable
|
| 79 |
-
label_map = {"LABEL_0": "Person", "LABEL_1": "Company"}
|
| 80 |
-
|
| 81 |
-
def classify_name(name_str):
|
| 82 |
-
result = classifier(name_str)[0]
|
| 83 |
-
return label_map.get(result['label']), result['score']
|
| 84 |
-
|
| 85 |
-
# --- Examples ---
|
| 86 |
-
print(f"'Alonso Sarmiento Martinez' -> {classify_name('Alonso Sarmiento Martinez')}")
|
| 87 |
-
print(f"'Microsoft Inc.' -> {classify_name('Microsoft Inc.')}")
|
| 88 |
-
print(f"'Schwehr, Dave' -> {classify_name('Schwehr, Dave')}")
|
| 89 |
-
print(f"'Ford' -> {classify_name('Ford')} (An ambiguous case)")
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
## Training Details
|
| 93 |
-
|
| 94 |
-
### Training Data
|
| 95 |
-
|
| 96 |
-
The model was trained on the `ele-sage/person-company-names-classification` dataset, a custom-curated and balanced dataset of **7,892,165 examples** constructed from two primary sources:
|
| 97 |
-
|
| 98 |
-
1. **Person Names Source:** An AI-cleaned subset of a large CSV file of Canadian names, originally from a Facebook data leak.
|
| 99 |
-
2. **Company Names Source:** A filtered subset of the public data from the [Quebec Enterprise Register](https://www.registreentreprises.gouv.qc.ca/RQAnonymeGR/GR/GR03/GR03A2_22A_PIU_RecupDonnPub_PC/FichierDonneesOuvertes.aspx).
|
| 100 |
-
|
| 101 |
-
### Training Procedure
|
| 102 |
-
|
| 103 |
-
#### Preprocessing & Curation
|
| 104 |
-
|
| 105 |
-
The dataset was carefully curated to improve model robustness and real-world performance.
|
| 106 |
-
|
| 107 |
-
1. **Data Augmentation (Person Names):** To ensure the model could handle various formats, the person name data was augmented into a **50/25/25 split**:
|
| 108 |
-
- **50%** was formatted as `FirstName LastName`.
|
| 109 |
-
- **25%** was formatted as the unambiguous `LastName, FirstName`.
|
| 110 |
-
- **25%** was formatted as the ambiguous `LastName FirstName`.
|
| 111 |
-
- These examples were assigned the label `0` (Person).
|
| 112 |
-
|
| 113 |
-
2. **Company Data Curation:**
|
| 114 |
-
- The dataset was filtered to remove extremely long company names (over 75 characters) that often contained legal descriptions. Numbered companies were kept as a strong signal.
|
| 115 |
-
- These examples were assigned the label `1` (Company).
|
| 116 |
-
|
| 117 |
-
3. **Final Dataset:** The augmented person data and the curated company data were combined and thoroughly shuffled.
|
| 118 |
-
|
| 119 |
-
#### Training Hyperparameters
|
| 120 |
-
|
| 121 |
-
- **Framework:** Transformers `Trainer`
|
| 122 |
-
- **Training regime:** `bf16`
|
| 123 |
-
- **Epochs:** 3
|
| 124 |
-
- **Batch Size:** 1024
|
| 125 |
-
- **Optimizer:** AdamW
|
| 126 |
-
- **Learning Rate:** `2e-5`
|
| 127 |
-
- **Warmup Steps:** 250
|
| 128 |
-
- **Evaluation Strategy:** Every `1000` steps
|
| 129 |
-
|
| 130 |
-
## Evaluation
|
| 131 |
-
|
| 132 |
-
### Metrics
|
| 133 |
-
|
| 134 |
-
The model's performance is evaluated using **Accuracy**, which is a suitable metric for this well-balanced, binary classification task.
|
| 135 |
-
|
| 136 |
-
- **Accuracy:** What percentage of names (both persons and companies) did the model classify correctly?
|
| 137 |
-
|
| 138 |
-
### Results
|
| 139 |
-
|
| 140 |
-
The final model was selected based on the highest accuracy achieved on the validation set during training. This ensures the saved model represents the point of peak performance before overfitting began.
|
| 141 |
-
|
| 142 |
-
| Metric | Value |
|
| 143 |
-
| :--- | :--- |
|
| 144 |
-
| **eval_accuracy** | **99.36%** |
|
| 145 |
-
| **eval_loss**| **0.0236** |
|
| 146 |
-
|
| 147 |
-
This result demonstrates a high degree of accuracy and confidence on the unseen validation data.
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
base_model: distilbert/distilbert-base-uncased
|
| 5 |
+
tags:
|
| 6 |
+
- generated_from_trainer
|
| 7 |
+
metrics:
|
| 8 |
+
- accuracy
|
| 9 |
+
- precision
|
| 10 |
+
- recall
|
| 11 |
+
- f1
|
| 12 |
+
model-index:
|
| 13 |
+
- name: distilbert-base-uncased-name-classifier
|
| 14 |
+
results: []
|
| 15 |
---
|
| 16 |
|
| 17 |
+
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 18 |
+
should probably proofread and complete it, then remove this comment. -->
|
| 19 |
|
| 20 |
+
# distilbert-base-uncased-name-classifier
|
| 21 |
|
| 22 |
+
This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on an unknown dataset.
|
| 23 |
+
It achieves the following results on the evaluation set:
|
| 24 |
+
- Loss: 0.0230
|
| 25 |
+
- Accuracy: 0.9937
|
| 26 |
+
- Precision: 0.9983
|
| 27 |
+
- Recall: 0.9904
|
| 28 |
+
- F1: 0.9943
|
| 29 |
|
| 30 |
+
## Model description
|
| 31 |
|
| 32 |
+
More information needed
|
| 33 |
|
| 34 |
+
## Intended uses & limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
More information needed
|
| 37 |
|
| 38 |
+
## Training and evaluation data
|
| 39 |
|
| 40 |
+
More information needed
|
| 41 |
|
| 42 |
+
## Training procedure
|
|
|
|
| 43 |
|
| 44 |
+
### Training hyperparameters
|
|
|
|
| 45 |
|
| 46 |
+
The following hyperparameters were used during training:
|
| 47 |
+
- learning_rate: 2e-05
|
| 48 |
+
- train_batch_size: 256
|
| 49 |
+
- eval_batch_size: 256
|
| 50 |
+
- seed: 42
|
| 51 |
+
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 52 |
+
- lr_scheduler_type: linear
|
| 53 |
+
- lr_scheduler_warmup_steps: 1000
|
| 54 |
+
- num_epochs: 1
|
| 55 |
|
| 56 |
+
### Training results
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
|
| 59 |
+
|:-------------:|:------:|:-----:|:---------------:|:--------:|:---------:|:------:|:------:|
|
| 60 |
+
| 0.0397 | 0.0718 | 2000 | 0.0405 | 0.9885 | 0.9981 | 0.9812 | 0.9896 |
|
| 61 |
+
| 0.0324 | 0.1435 | 4000 | 0.0303 | 0.9914 | 0.9970 | 0.9875 | 0.9923 |
|
| 62 |
+
| 0.031 | 0.2153 | 6000 | 0.0295 | 0.9914 | 0.9938 | 0.9907 | 0.9923 |
|
| 63 |
+
| 0.0295 | 0.2870 | 8000 | 0.0271 | 0.9924 | 0.9970 | 0.9894 | 0.9932 |
|
| 64 |
+
| 0.0275 | 0.3588 | 10000 | 0.0262 | 0.9926 | 0.9964 | 0.9904 | 0.9934 |
|
| 65 |
+
| 0.0281 | 0.4305 | 12000 | 0.0256 | 0.9930 | 0.9981 | 0.9893 | 0.9937 |
|
| 66 |
+
| 0.0244 | 0.5023 | 14000 | 0.0272 | 0.9926 | 0.9991 | 0.9876 | 0.9933 |
|
| 67 |
+
| 0.0229 | 0.5740 | 16000 | 0.0254 | 0.9931 | 0.9970 | 0.9907 | 0.9938 |
|
| 68 |
+
| 0.0264 | 0.6458 | 18000 | 0.0248 | 0.9932 | 0.9986 | 0.9892 | 0.9939 |
|
| 69 |
+
| 0.0258 | 0.7175 | 20000 | 0.0237 | 0.9934 | 0.9983 | 0.9899 | 0.9941 |
|
| 70 |
+
| 0.0236 | 0.7893 | 22000 | 0.0234 | 0.9936 | 0.9982 | 0.9903 | 0.9943 |
|
| 71 |
+
| 0.0253 | 0.8610 | 24000 | 0.0231 | 0.9936 | 0.9979 | 0.9907 | 0.9943 |
|
| 72 |
+
| 0.0248 | 0.9328 | 26000 | 0.0230 | 0.9937 | 0.9983 | 0.9904 | 0.9943 |
|
| 73 |
|
|
|
|
| 74 |
|
| 75 |
+
### Framework versions
|
| 76 |
|
| 77 |
+
- Transformers 4.57.1
|
| 78 |
+
- Pytorch 2.9.0+cu128
|
| 79 |
+
- Datasets 4.4.1
|
| 80 |
+
- Tokenizers 0.22.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
|
@@ -1,24 +1,32 @@
|
|
| 1 |
-
{
|
| 2 |
-
"activation": "gelu",
|
| 3 |
-
"architectures": [
|
| 4 |
-
"DistilBertForSequenceClassification"
|
| 5 |
-
],
|
| 6 |
-
"attention_dropout": 0.1,
|
| 7 |
-
"dim": 768,
|
| 8 |
-
"dropout": 0.1,
|
| 9 |
-
"dtype": "float32",
|
| 10 |
-
"hidden_dim": 3072,
|
| 11 |
-
"
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
"
|
| 16 |
-
"
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
"
|
| 21 |
-
"
|
| 22 |
-
"
|
| 23 |
-
"
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"activation": "gelu",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"DistilBertForSequenceClassification"
|
| 5 |
+
],
|
| 6 |
+
"attention_dropout": 0.1,
|
| 7 |
+
"dim": 768,
|
| 8 |
+
"dropout": 0.1,
|
| 9 |
+
"dtype": "float32",
|
| 10 |
+
"hidden_dim": 3072,
|
| 11 |
+
"id2label": {
|
| 12 |
+
"0": "PERSON",
|
| 13 |
+
"1": "COMPANY"
|
| 14 |
+
},
|
| 15 |
+
"initializer_range": 0.02,
|
| 16 |
+
"label2id": {
|
| 17 |
+
"COMPANY": 1,
|
| 18 |
+
"PERSON": 0
|
| 19 |
+
},
|
| 20 |
+
"max_position_embeddings": 512,
|
| 21 |
+
"model_type": "distilbert",
|
| 22 |
+
"n_heads": 12,
|
| 23 |
+
"n_layers": 6,
|
| 24 |
+
"pad_token_id": 0,
|
| 25 |
+
"problem_type": "single_label_classification",
|
| 26 |
+
"qa_dropout": 0.1,
|
| 27 |
+
"seq_classif_dropout": 0.2,
|
| 28 |
+
"sinusoidal_pos_embds": false,
|
| 29 |
+
"tie_weights_": true,
|
| 30 |
+
"transformers_version": "4.57.1",
|
| 31 |
+
"vocab_size": 30522
|
| 32 |
+
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 267832560
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:348d244e9a1f1cf3da7e6668e29924033194e06b0e80c4ab90588d0f11ca9bd9
|
| 3 |
size 267832560
|
runs/Dec07_11-33-21_elesage-pc/events.out.tfevents.1765125303.elesage-pc.29575.0
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8b5a0e2404510a608b3b32be9698b6ed7831a120e8a9bd29d5f408d675246389
|
| 3 |
+
size 34031
|
runs/Dec07_11-48-14_elesage-pc/events.out.tfevents.1765126195.elesage-pc.37789.0
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:666026917b9ea4c7d4fd129ba1c37c5d36ddcf178e404ef8c0f1678fc506c260
|
| 3 |
+
size 13646
|
runs/Dec07_19-14-01_elesage-pc/events.out.tfevents.1765152955.elesage-pc.189935.0
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f9b96b3155e7d9bafc72ee809af26ae42326b917969ac3badae0fe239bbe62db
|
| 3 |
+
size 41158
|
special_tokens_map.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cls_token": "[CLS]",
|
| 3 |
-
"mask_token": "[MASK]",
|
| 4 |
-
"pad_token": "[PAD]",
|
| 5 |
-
"sep_token": "[SEP]",
|
| 6 |
-
"unk_token": "[UNK]"
|
| 7 |
-
}
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
tokenizer.json
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
"version": "1.0",
|
| 3 |
"truncation": {
|
| 4 |
"direction": "Right",
|
| 5 |
-
"max_length":
|
| 6 |
"strategy": "LongestFirst",
|
| 7 |
"stride": 0
|
| 8 |
},
|
|
|
|
| 2 |
"version": "1.0",
|
| 3 |
"truncation": {
|
| 4 |
"direction": "Right",
|
| 5 |
+
"max_length": 120,
|
| 6 |
"strategy": "LongestFirst",
|
| 7 |
"stride": 0
|
| 8 |
},
|
tokenizer_config.json
CHANGED
|
@@ -1,56 +1,56 @@
|
|
| 1 |
-
{
|
| 2 |
-
"added_tokens_decoder": {
|
| 3 |
-
"0": {
|
| 4 |
-
"content": "[PAD]",
|
| 5 |
-
"lstrip": false,
|
| 6 |
-
"normalized": false,
|
| 7 |
-
"rstrip": false,
|
| 8 |
-
"single_word": false,
|
| 9 |
-
"special": true
|
| 10 |
-
},
|
| 11 |
-
"100": {
|
| 12 |
-
"content": "[UNK]",
|
| 13 |
-
"lstrip": false,
|
| 14 |
-
"normalized": false,
|
| 15 |
-
"rstrip": false,
|
| 16 |
-
"single_word": false,
|
| 17 |
-
"special": true
|
| 18 |
-
},
|
| 19 |
-
"101": {
|
| 20 |
-
"content": "[CLS]",
|
| 21 |
-
"lstrip": false,
|
| 22 |
-
"normalized": false,
|
| 23 |
-
"rstrip": false,
|
| 24 |
-
"single_word": false,
|
| 25 |
-
"special": true
|
| 26 |
-
},
|
| 27 |
-
"102": {
|
| 28 |
-
"content": "[SEP]",
|
| 29 |
-
"lstrip": false,
|
| 30 |
-
"normalized": false,
|
| 31 |
-
"rstrip": false,
|
| 32 |
-
"single_word": false,
|
| 33 |
-
"special": true
|
| 34 |
-
},
|
| 35 |
-
"103": {
|
| 36 |
-
"content": "[MASK]",
|
| 37 |
-
"lstrip": false,
|
| 38 |
-
"normalized": false,
|
| 39 |
-
"rstrip": false,
|
| 40 |
-
"single_word": false,
|
| 41 |
-
"special": true
|
| 42 |
-
}
|
| 43 |
-
},
|
| 44 |
-
"clean_up_tokenization_spaces": false,
|
| 45 |
-
"cls_token": "[CLS]",
|
| 46 |
-
"do_lower_case": true,
|
| 47 |
-
"extra_special_tokens": {},
|
| 48 |
-
"mask_token": "[MASK]",
|
| 49 |
-
"model_max_length": 512,
|
| 50 |
-
"pad_token": "[PAD]",
|
| 51 |
-
"sep_token": "[SEP]",
|
| 52 |
-
"strip_accents": null,
|
| 53 |
-
"tokenize_chinese_chars": true,
|
| 54 |
-
"tokenizer_class": "DistilBertTokenizer",
|
| 55 |
-
"unk_token": "[UNK]"
|
| 56 |
-
}
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": false,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"extra_special_tokens": {},
|
| 48 |
+
"mask_token": "[MASK]",
|
| 49 |
+
"model_max_length": 512,
|
| 50 |
+
"pad_token": "[PAD]",
|
| 51 |
+
"sep_token": "[SEP]",
|
| 52 |
+
"strip_accents": null,
|
| 53 |
+
"tokenize_chinese_chars": true,
|
| 54 |
+
"tokenizer_class": "DistilBertTokenizer",
|
| 55 |
+
"unk_token": "[UNK]"
|
| 56 |
+
}
|
training_args.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4cc87171a4c10cd5c65823f0406598b1e6064512435b585c4d33fd491196ffc1
|
| 3 |
+
size 5905
|