mmBERT Multilingual PII NER

A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.

Model Description

This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.

  • Architecture: mmBERT-base (ModernBERT) + CRF head
  • Training: Fine-tuned on all 11 languages jointly (multilingual training)
  • Loss: Cross-Entropy
  • Hyperparameters: lr=2e-05, batch_size=32, max_length=512, dropout=0.1, epochs=10
  • Decoding: Viterbi decoding via CRF layer

Supported Languages

Code Language
AR Arabic
DE German
EN English
FI Finnish
FR French
HI Hindi
IT Italian
PL Polish
PT Portuguese
SP Spanish
TR Turkish

Entity Types

The model recognizes 19 PII entity types using BIO tagging:

Entity Description
PERSON Person names
PERSON_EMAIL Email addresses
PERSON_SOCIAL_RELATION Social relations (e.g., "my wife")
ORG Organizations
LOC_CITY Cities
LOC_COUNTRY Countries
LOC_STREET Street names
LOC_ZIP ZIP/postal codes
LOC_HOUSENUMBER House numbers
LOC_OTHER Other locations
DATETIME Dates and times
DATETIME_AGE Ages
CODE ID numbers, reference codes
CODE_PHONE Phone numbers
CODE_URL URLs
PROFESSION Professions
PRODUCT Product names
QUANTITY Quantities
MISC Miscellaneous PII

Performance

Evaluated on held-out test sets per language (type-aware micro scores):

Language Lenient F1 Lenient F2 Exact F1 Exact F2
AR 80.76 76.66 76.99 73.08
DE 91.66 90.71 90.54 89.60
EN 93.68 92.70 91.66 90.70
FI 87.70 86.77 85.65 84.73
FR 87.26 85.89 83.68 82.36
HI 84.94 82.91 81.26 79.31
IT 90.03 88.19 87.14 85.35
PL 89.33 89.45 86.17 86.29
PT 90.30 89.15 88.81 87.68
SP 91.39 90.76 89.62 89.00
TR 85.53 84.72 82.06 81.27
AVG 88.42 87.08 85.78 84.49

Usage

This model uses a custom CRF architecture and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class:

import torch
import json
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
import torch.nn as nn

class ModernBertCRF(nn.Module):
    def __init__(self, base_model_name, num_labels, id2label, label2id):
        super().__init__()
        self.num_labels = num_labels
        self.id2label = id2label
        self.label2id = label2id
        self.transformer = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.transformer.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)
        self.crf = CRF(num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        kwargs.pop("token_type_ids", None)
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs.last_hidden_state)
        emissions = self.classifier(sequence_output)
        if labels is not None:
            mask = attention_mask.bool()
            labels_for_crf = labels.clone()
            labels_for_crf[labels_for_crf == -100] = 0
            loss = -self.crf(emissions, labels_for_crf, mask=mask, reduction='mean')
            return {"loss": loss, "logits": emissions}
        else:
            return {"logits": emissions}

    def decode(self, emissions, mask):
        return self.crf.decode(emissions, mask=mask)

# Load model
model_dir = "deryaerman/mmbert_multilingual_pii_ner"

with open(f"{model_dir}/crf_config.json") as f:
    config = json.load(f)

model = ModernBertCRF(
    base_model_name=config["base_model_name"],
    num_labels=config["num_labels"],
    id2label=config["id2label"],
    label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_dir)

# Inference
text = "My name is John Smith and I live in Berlin."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    outputs = model(**inputs)
    emissions = outputs["logits"]
    mask = inputs["attention_mask"].bool()
    predictions = model.decode(emissions, mask)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred_id in zip(tokens, predictions[0]):
    label = config["id2label"][str(pred_id)]
    if label != "O":
        print(f"{token:20s} -> {label}")

Training Data

The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.

Limitations

  • Trained on synthetic dialogue data; performance on real-world data may vary
  • Optimized for dialogue/conversational text; may underperform on formal documents
  • Arabic and Hindi show lower performance compared to European languages
  • Requires pytorch-crf package for inference

Citation

If you use this model, please cite:

@mastersthesis{erman2026multilingual,
  title={Multilingual De-Identification of Dialogue Data using Transformer-based NER},
  author={Erman, Derya},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for deryaerman/mmbert_multilingual_pii_ner

Finetuned
(89)
this model