--- library_name: transformers language: - en - fr - it - es - ru - uk - tt - ar - hi - ja - zh - he - am - de license: openrail++ datasets: - textdetox/multilingual_toxicity_dataset metrics: - f1 base_model: - cardiffnlp/twitter-xlm-roberta-large-2022 pipeline_tag: text-classification --- ## Multilingual Toxicity Classifier for 15 Languages (2025) This is an instance of [cardiffnlp/twitter-xlm-roberta-large-2022](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-large-2022) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset). Now, the models covers 15 languages from various language families: | Language | Code | F1 Score | |-----------|------|---------| | English | en | 0.9071 | | Russian | ru | 0.9022 | | Ukrainian | uk | 0.9075 | | German | de | 0.6528 | | Spanish | es | 0.7430 | | Arabic | ar | 0.6207 | | Amharic | am | 0.6676 | | Hindi | hi | 0.7171 | | Chinese | zh | 0.6483 | | Italian | it | 0.7597 | | French | fr | 0.9114 | | Hinglish | hin | 0.7051 | | Hebrew | he | 0.8911 | | Japanese | ja | 0.8725 | | Tatar | tt | 0.6542 | ## How to use ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('textdetox/twitter-xlmr-toxicity-classifier') model = AutoModelForSequenceClassification.from_pretrained('textdetox/twitter-xlmr-toxicity-classifier') batch = tokenizer.encode("You are amazing!", return_tensors="pt") output = model(batch) # idx 0 for neutral, idx 1 for toxic ``` ## Citation The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation. Citation TBD soon.