huBERT-fine-tuned-villain (Checkpoint 3708)
This model is a fine-tuned version of SZTAKI-HLT/hubert-base-cc designed to detect "Villain" rhetoric targeting specific opposition parties (MSZP & DK) in Hungarian political media.
Model Description
- Model type: BERT-based binary text classification
- Language: Hungarian
- Finetuned from model:
SZTAKI-HLT/hubert-base-cc - Label Mapping:
- Label 1 (Villain): The sentence employs specific polarizing rhetoric to delegitimize or demonize MSZP/DK.
- Label 0 (Non-Villain): The sentence is neutral, factual, or does not contain the specific rhetoric defined below.
Label Definitions: What is "Villain" Rhetoric?
The model was trained to identify "Villain" rhetoric based on a codebook containing two primary components: Illegitimacy and Immorality.
1. Illegitimacy Component Sentences that portray MSZP/DK as illegitimate political actors, specifically containing claims that:
- Communist Legacy: They are upholding the legacy of the communist regime, constitute a continuation of discredited pre-2010 governments, or undermine the constitutional order.
- Plotting Unrest: They are planning street violence, destabilization, or schemes to topple the government via extra-parliamentary means.
- Serving Foreign Interests: They cooperate with or are controlled by foreign powers, EU institutions, or international NGOs (e.g., Soros, IMF) to undermine the Hungarian state.
- Anti-National Interest: They actively work against the "general will" or interests of the nation (e.g., on migration or border security).
2. Immorality Component Sentences that portray MSZP/DK as immoral actors, specifically containing claims of:
- Lack of Principles: Vilifying them as liars, hypocrites, or devoid of morals.
- Corruption: Alleging participation in corrupt practices, abuse of power, or malpractice (often referencing the pre-2010 era).
- Aggression/Slander: Depicting them as insulting, slandering, or threatening other politicians or society.
- Violating Norms: Showing them as disrespectful toward established public morals or religious values.
Training Data
The training dataset consists of a stratified sample of online news articles published between 2010 and 2024.
- Sampling Strategy: Sentences were selected if they contained specific keywords related to MSZP & DK (party names, abbreviations, and leader surnames within their tenure).
- Label Generation (Supervised Distillation): The training labels were generated using a specific pipeline to ensure validity. First, a GPT-4o model was fine-tuned on 800 human-annotated sentences (https://huggingface.co/datasets/Politics/hungary-mszp-dk-gpt4o) to strictly align with the codebook definitions. This validated "Teacher" model (81% F1 score in heldout set) then predicted labels for the larger unlabelled corpus. These synthetic labels were used to train this efficient BERT model.
- Class Imbalance: The model was trained using a Weighted Cross Entropy Loss to account for the imbalance between villain and non-villain classes in the training set.
Training Procedure
Hyperparameters
The model was trained using the transformers Trainer with the following hyperparameters:
- Learning Rate: 5e-5
- Train Batch Size: 32
- Eval Batch Size: 64
- Seed: 42
- Optimizer: AdamW
- Num Epochs: 10
- Max Sequence Length: 512
- Loss Function: Weighted Cross Entropy (to penalize minority class errors more heavily).
Evaluation Results
Evaluation was performed on a held-out ground truth dataset (Politics/hungary-mszp-dk-heldout) consisting of human-labeled sentences that were not seen by the Teacher GPT model or the Student BERT model during training.
Performance of Checkpoint 3708 (9th Checkpoint):
| Metric | Score | Note |
|---|---|---|
| Accuracy | 90.1% | Overall correct classification rate. |
| Positive Precision | 81.5% | When model predicts "Villain", it is correct 81.5% of the time. |
| Positive Recall | 68.3% | The model catches 68.3% of all actual "Villain" sentences. |
| Positive F1 | 74.3% | Harmonic mean of precision and recall for the target class. |
| Negative F1 | 93.9% | Performance on the non-villain class. |
Note: This checkpoint was selected for its high precision (81.5%), minimizing false positives, which is crucial for automated content analysis in social science research.
How to use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "Politics/hungary-mszp-dk-villain/"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "A dollárbaloldal ismét elárulta a hazát Brüsszelben."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
print(f"Prediction: {predicted_class_id}")
# Output: 1 (Villain) or 0 (Neutral)
Limitations
Domain Specificity: This model is trained on specific Hungarian political news contexts (2010-2024) regarding specific parties (MSZP/DK) and may not generalize well to other parties or domains.
Error Propagation: The training data utilizes "silver" labels generated by a fine-tuned GPT-4o. While the GPT-4o teacher was trained on human data, this model is essentially learning to approximate the teacher model's application of the codebook. Errors or biases present in the teacher model's predictions will be propagated to this BERT model.
Human labeling from English translations: While both the finetuned GPT-4o and the hubert models rely on labeled sentences in Hungarian (and expect new sentences in Hungarian), the human labeling was done through the English translations for both the ground-truth (heldout) and also the GPT-4o finetune set.
Acknowledgments
I am deeply grateful to the University of Chicago Forum for Free Inquiry and Expression for their generous funding of this research.
- Downloads last month
- -
Model tree for Politics/hungary-mszp-dk-villain
Base model
SZTAKI-HLT/hubert-base-cc