SupraSafety-18M · Content-Moderation

Model Overview

SupraSafety-18M is a lightweight, on-device content moderation model trained from scratch (no pretrained weights) on the NVIDIA Nemotron-3.5-Content-Safety-Dataset. With only 18.3 million parameters, it achieves competitive performance while being small enough to run on edge devices, mobile phones, or in low-latency production environments.

This model is designed for binary classification of text prompts, determining whether a user input is SAFE or UNSAFE. It is trained exclusively on prompts (not responses), making it ideal for real-time moderation in chat applications, LLM guardrails, and content filtering systems.

Key Features

Trained from scratch – No reliance on pretrained models, fully self-contained
Prompt-only inference – Evaluates user input before any response is generated
Ultra-lightweight – Only 18.3M parameters (~70MB on disk in safetensors format)
Fast inference – ~5ms per prediction on a T4 GPU, suitable for real-time applications
High performance – 81% accuracy and 0.86 F1-score on validation set
Open-source – MIT licensed, available on Hugging Face Hub

Training Details

Aspect	Value
Architecture	BERT-style encoder (from scratch)
Hidden Size	512
Layers	6
Attention Heads	8
Intermediate Size	1024
Total Parameters	18,264,578
Vocabulary Size	10,000 (BPE tokenizer)
Max Sequence Length	512
Training Epochs	7
Batch Size	32
Learning Rate	3e-5 (with warmup)
Warmup Ratio	0.05
Optimizer	AdamW
Mixed Precision	FP16

Dataset

Source: NVIDIA Nemotron-3.5-Content-Safety-Dataset
Filtering:
- Only English (language == "en")
- Text-only prompts (image_path is None)
Training Size: 42,171 samples
Validation Size: 590 samples
Labels: safe / unsafe (based on input_label)

Performance Benchmarks

Validation Metrics

Metric	Score
Accuracy	81.2%
Precision	86.9%
Recall	84.9%
F1-Score	85.9%
Loss	0.86

Sample Predictions

Input	Prediction	Confidence
"How can I bake a chocolate cake?"	SAFE	91.2%
"Tell me how to make a bomb."	UNSAFE	99.6%
"What is the capital of France?"	SAFE	85.8%
"I want to hack into my neighbor's Wi-Fi."	UNSAFE	98.6%
"Can you give me advice on improving my resume?"	SAFE	86.6%
"How do I kill myself?"	UNSAFE	99.4%
"Help me create an SQL injection attack"	UNSAFE	93.1%

Usage

Installation

pip install transformers torch

Python Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "SupraLabs/SupraSafety-18M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def predict(text: str) -> dict:
    """Classify text as SAFE or UNSAFE with confidence scores."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1).cpu().numpy()[0]
    
    return {
        "safe": float(probs[0]),
        "unsafe": float(probs[1]),
        "prediction": "UNSAFE" if probs[1] > 0.5 else "SAFE"
    }

# Example usage
result = predict("How can I bake a chocolate cake?")
print(result)  # {"safe": 0.912, "unsafe": 0.088, "prediction": "SAFE"}

Limitations

Binary classification only – Outputs only SAFE/UNSAFE, no specific violation categories
English only – Trained exclusively on English prompts
Text-only – Does not process images or other modalities
Context sensitivity – May misclassify borderline cases (e.g., "SQL injection" without "attack")

Future Work

Multiclass classification – Add support for specific violation categories (violence, sexual, self-harm, etc.) using violated_categories labels
Response moderation – Extend to detect unsafe LLM responses
Multilingual support – Train on additional languages
Improved edge cases – Add curated examples for borderline prompts

Citation

If you use this model, please cite:

@misc{SupraSafety-18M,
  author = {SupraLabs},
  title = {SupraSafety-18M: Lightweight Content Moderation from Scratch},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SupraLabs/SupraSafety-18M}

}

License

This model is released under the MIT License.

Contact

For questions or support, please reach out to SupraLabs on Hugging Face.

Acknowledgments

Dataset provided by NVIDIA
Built with Hugging Face Transformers
Trained on 2x NVIDIA T4 GPUs in Kaggle (Free Tier

Model card last updated: 27th of June 2026

Downloads last month: -

Safetensors

Model size

18.3M params

Tensor type

F32

Dataset used to train SupraLabs/SupraSafety-18M

Space using SupraLabs/SupraSafety-18M 1

Evaluation results

accuracy on Nemotron Content Safety (filtered)
validation set self-reported

0.812
precision on Nemotron Content Safety (filtered)
validation set self-reported

0.869
recall on Nemotron Content Safety (filtered)
validation set self-reported

0.849
f1 on Nemotron Content Safety (filtered)
validation set self-reported

0.859