mmBERT Multilingual PII NER

A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.

Model Description

This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.

Architecture: mmBERT-base (ModernBERT) + CRF head
Training: Fine-tuned on all 11 languages jointly (multilingual training)
Loss: Cross-Entropy
Hyperparameters: lr=2e-05, batch_size=32, max_length=512, dropout=0.1, epochs=10
Decoding: Viterbi decoding via CRF layer

Supported Languages

Code	Language
AR	Arabic
DE	German
EN	English
FI	Finnish
FR	French
HI	Hindi
IT	Italian
PL	Polish
PT	Portuguese
SP	Spanish
TR	Turkish

Entity Types

The model recognizes 19 PII entity types using BIO tagging:

Entity	Description
`PERSON`	Person names
`PERSON_EMAIL`	Email addresses
`PERSON_SOCIAL_RELATION`	Social relations (e.g., "my wife")
`ORG`	Organizations
`LOC_CITY`	Cities
`LOC_COUNTRY`	Countries
`LOC_STREET`	Street names
`LOC_ZIP`	ZIP/postal codes
`LOC_HOUSENUMBER`	House numbers
`LOC_OTHER`	Other locations
`DATETIME`	Dates and times
`DATETIME_AGE`	Ages
`CODE`	ID numbers, reference codes
`CODE_PHONE`	Phone numbers
`CODE_URL`	URLs
`PROFESSION`	Professions
`PRODUCT`	Product names
`QUANTITY`	Quantities
`MISC`	Miscellaneous PII

Performance

Evaluated on held-out test sets per language (type-aware micro scores):

Language	Lenient F1	Lenient F2	Exact F1	Exact F2
AR	80.76	76.66	76.99	73.08
DE	91.66	90.71	90.54	89.60
EN	93.68	92.70	91.66	90.70
FI	87.70	86.77	85.65	84.73
FR	87.26	85.89	83.68	82.36
HI	84.94	82.91	81.26	79.31
IT	90.03	88.19	87.14	85.35
PL	89.33	89.45	86.17	86.29
PT	90.30	89.15	88.81	87.68
SP	91.39	90.76	89.62	89.00
TR	85.53	84.72	82.06	81.27
AVG	88.42	87.08	85.78	84.49

Usage

This model uses a custom CRF architecture and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class:

import torch
import json
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
import torch.nn as nn

class ModernBertCRF(nn.Module):
    def __init__(self, base_model_name, num_labels, id2label, label2id):
        super().__init__()
        self.num_labels = num_labels
        self.id2label = id2label
        self.label2id = label2id
        self.transformer = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.transformer.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)
        self.crf = CRF(num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        kwargs.pop("token_type_ids", None)
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs.last_hidden_state)
        emissions = self.classifier(sequence_output)
        if labels is not None:
            mask = attention_mask.bool()
            labels_for_crf = labels.clone()
            labels_for_crf[labels_for_crf == -100] = 0
            loss = -self.crf(emissions, labels_for_crf, mask=mask, reduction='mean')
            return {"loss": loss, "logits": emissions}
        else:
            return {"logits": emissions}

    def decode(self, emissions, mask):
        return self.crf.decode(emissions, mask=mask)

# Load model
model_dir = "deryaerman/mmbert_multilingual_pii_ner"

with open(f"{model_dir}/crf_config.json") as f:
    config = json.load(f)

model = ModernBertCRF(
    base_model_name=config["base_model_name"],
    num_labels=config["num_labels"],
    id2label=config["id2label"],
    label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_dir)

# Inference
text = "My name is John Smith and I live in Berlin."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs.pop("token_type_ids", None)

with torch.no_grad():
    outputs = model(**inputs)
    emissions = outputs["logits"]
    mask = inputs["attention_mask"].bool()
    predictions = model.decode(emissions, mask)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred_id in zip(tokens, predictions[0]):
    label = config["id2label"][str(pred_id)]
    if label != "O":
        print(f"{token:20s} -> {label}")

Training Data

The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.

Limitations

Trained on synthetic dialogue data; performance on real-world data may vary
Optimized for dialogue/conversational text; may underperform on formal documents
Arabic and Hindi show lower performance compared to European languages
Requires pytorch-crf package for inference

Citation

If you use this model, please cite:

@mastersthesis{erman2026multilingual,
  title={Multilingual De-Identification of Dialogue Data using Transformer-based NER},
  author={Erman, Derya},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for deryaerman/mmbert_multilingual_pii_ner

Base model

jhu-clsp/mmBERT-base

Finetuned

(89)

this model