mmBERT Multilingual PII NER
A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.
Model Description
This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.
- Architecture: mmBERT-base (ModernBERT) + CRF head
- Training: Fine-tuned on all 11 languages jointly (multilingual training)
- Loss: Cross-Entropy
- Hyperparameters: lr=2e-05, batch_size=32, max_length=512, dropout=0.1, epochs=10
- Decoding: Viterbi decoding via CRF layer
Supported Languages
| Code | Language |
|---|---|
| AR | Arabic |
| DE | German |
| EN | English |
| FI | Finnish |
| FR | French |
| HI | Hindi |
| IT | Italian |
| PL | Polish |
| PT | Portuguese |
| SP | Spanish |
| TR | Turkish |
Entity Types
The model recognizes 19 PII entity types using BIO tagging:
| Entity | Description |
|---|---|
PERSON |
Person names |
PERSON_EMAIL |
Email addresses |
PERSON_SOCIAL_RELATION |
Social relations (e.g., "my wife") |
ORG |
Organizations |
LOC_CITY |
Cities |
LOC_COUNTRY |
Countries |
LOC_STREET |
Street names |
LOC_ZIP |
ZIP/postal codes |
LOC_HOUSENUMBER |
House numbers |
LOC_OTHER |
Other locations |
DATETIME |
Dates and times |
DATETIME_AGE |
Ages |
CODE |
ID numbers, reference codes |
CODE_PHONE |
Phone numbers |
CODE_URL |
URLs |
PROFESSION |
Professions |
PRODUCT |
Product names |
QUANTITY |
Quantities |
MISC |
Miscellaneous PII |
Performance
Evaluated on held-out test sets per language (type-aware micro scores):
| Language | Lenient F1 | Lenient F2 | Exact F1 | Exact F2 |
|---|---|---|---|---|
| AR | 80.76 | 76.66 | 76.99 | 73.08 |
| DE | 91.66 | 90.71 | 90.54 | 89.60 |
| EN | 93.68 | 92.70 | 91.66 | 90.70 |
| FI | 87.70 | 86.77 | 85.65 | 84.73 |
| FR | 87.26 | 85.89 | 83.68 | 82.36 |
| HI | 84.94 | 82.91 | 81.26 | 79.31 |
| IT | 90.03 | 88.19 | 87.14 | 85.35 |
| PL | 89.33 | 89.45 | 86.17 | 86.29 |
| PT | 90.30 | 89.15 | 88.81 | 87.68 |
| SP | 91.39 | 90.76 | 89.62 | 89.00 |
| TR | 85.53 | 84.72 | 82.06 | 81.27 |
| AVG | 88.42 | 87.08 | 85.78 | 84.49 |
Usage
This model uses a custom CRF architecture and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class:
import torch
import json
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
import torch.nn as nn
class ModernBertCRF(nn.Module):
def __init__(self, base_model_name, num_labels, id2label, label2id):
super().__init__()
self.num_labels = num_labels
self.id2label = id2label
self.label2id = label2id
self.transformer = AutoModel.from_pretrained(base_model_name)
hidden_size = self.transformer.config.hidden_size
self.classifier = nn.Linear(hidden_size, num_labels)
self.dropout = nn.Dropout(0.1)
self.crf = CRF(num_labels, batch_first=True)
def forward(self, input_ids, attention_mask, labels=None, **kwargs):
kwargs.pop("token_type_ids", None)
outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
sequence_output = self.dropout(outputs.last_hidden_state)
emissions = self.classifier(sequence_output)
if labels is not None:
mask = attention_mask.bool()
labels_for_crf = labels.clone()
labels_for_crf[labels_for_crf == -100] = 0
loss = -self.crf(emissions, labels_for_crf, mask=mask, reduction='mean')
return {"loss": loss, "logits": emissions}
else:
return {"logits": emissions}
def decode(self, emissions, mask):
return self.crf.decode(emissions, mask=mask)
# Load model
model_dir = "deryaerman/mmbert_multilingual_pii_ner"
with open(f"{model_dir}/crf_config.json") as f:
config = json.load(f)
model = ModernBertCRF(
base_model_name=config["base_model_name"],
num_labels=config["num_labels"],
id2label=config["id2label"],
label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir)
# Inference
text = "My name is John Smith and I live in Berlin."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs.pop("token_type_ids", None)
with torch.no_grad():
outputs = model(**inputs)
emissions = outputs["logits"]
mask = inputs["attention_mask"].bool()
predictions = model.decode(emissions, mask)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred_id in zip(tokens, predictions[0]):
label = config["id2label"][str(pred_id)]
if label != "O":
print(f"{token:20s} -> {label}")
Training Data
The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.
Limitations
- Trained on synthetic dialogue data; performance on real-world data may vary
- Optimized for dialogue/conversational text; may underperform on formal documents
- Arabic and Hindi show lower performance compared to European languages
- Requires
pytorch-crfpackage for inference
Citation
If you use this model, please cite:
@mastersthesis{erman2026multilingual,
title={Multilingual De-Identification of Dialogue Data using Transformer-based NER},
author={Erman, Derya},
year={2026}
}
Model tree for deryaerman/mmbert_multilingual_pii_ner
Base model
jhu-clsp/mmBERT-base