Summary

A fine-tuned ModernBERT-base model for multi-label subject classification of educational web text. Given a passage of text, it predicts which of 17 academic/professional subject categories apply.

Model Details

Property Value
Base model answerdotai/ModernBERT-base
Architecture ModernBertForSequenceClassification
Task Multi-label classification
Number of labels 17
Max input length 512 tokens
Hidden size 768
Attention heads 12
Transformer layers 22 (alternating full + sliding window attention)
Pooling Mean pooling

Labels

Index Field Display Name
0 mathematics_statistics Mathematics Statistics
1 computer_science_software_engineering Computer Science Software Engineering
2 machine_learning_ai Machine Learning AI
3 physical_sciences Physical Sciences
4 life_sciences_biology Life Sciences Biology
5 medicine_health Medicine Health
6 engineering_technology Engineering Technology
7 business_economics Business Economics
8 law_government Law Government
9 social_sciences Social Sciences
10 history_geography History Geography
11 philosophy_ethics Philosophy Ethics
12 education_pedagogy Education Pedagogy
13 language_writing Language Writing
14 arts_humanities Arts Humanities
15 environmental_science_energy Environmental Science Energy
16 personal_finance_practical_life Personal Finance Practical Life

Training Data

  • Source: HuggingFaceFW/fineweb-edu (CC-MAIN-2021-04 shard) plus ~50K rows from HuggingFaceFW/fineweb (10BT sample)
  • Labels were generated by gpt-5-nano via the OpenAI Batch API (~$80 in batch credits)
  • Data was split 80% train / 10% val / 10% test (random seed 42)

Training Configuration

Hyperparameter Value
Epochs 3
Batch size 32
Learning rate 2e-5
Weight decay 0.01
Warmup ratio 0.1
Max token length 512
Optimizer AdamW
Scheduler Linear with warmup
AMP bf16 (on CUDA)
Gradient clipping max norm 1.0

Model checkpoint was saved at the epoch with the best validation micro-F1 (epoch 2).

Test Set Performance

Metric Score
Micro F1 0.8545
Macro F1 0.8264
Precision (micro) 0.8799
Recall (micro) 0.8304
Loss 0.1222
Downloads last month
3
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdonigian/fineweb-edu-topic-classifier

Finetuned
(1089)
this model

Datasets used to train mdonigian/fineweb-edu-topic-classifier