SupraSafety-18M · Content-Moderation

Safety_Supra

Model Overview

SupraSafety-18M is a lightweight, on-device content moderation model trained from scratch (no pretrained weights) on the NVIDIA Nemotron-3.5-Content-Safety-Dataset. With only 18.3 million parameters, it achieves competitive performance while being small enough to run on edge devices, mobile phones, or in low-latency production environments.

This model is designed for binary classification of text prompts, determining whether a user input is SAFE or UNSAFE. It is trained exclusively on prompts (not responses), making it ideal for real-time moderation in chat applications, LLM guardrails, and content filtering systems.


Key Features

  • Trained from scratch – No reliance on pretrained models, fully self-contained
  • Prompt-only inference – Evaluates user input before any response is generated
  • Ultra-lightweight – Only 18.3M parameters (~70MB on disk in safetensors format)
  • Fast inference – ~5ms per prediction on a T4 GPU, suitable for real-time applications
  • High performance – 81% accuracy and 0.86 F1-score on validation set
  • Open-source – MIT licensed, available on Hugging Face Hub

Training Details

Aspect Value
Architecture BERT-style encoder (from scratch)
Hidden Size 512
Layers 6
Attention Heads 8
Intermediate Size 1024
Total Parameters 18,264,578
Vocabulary Size 10,000 (BPE tokenizer)
Max Sequence Length 512
Training Epochs 7
Batch Size 32
Learning Rate 3e-5 (with warmup)
Warmup Ratio 0.05
Optimizer AdamW
Mixed Precision FP16

Dataset

  • Source: NVIDIA Nemotron-3.5-Content-Safety-Dataset
  • Filtering:
    • Only English (language == "en")
    • Text-only prompts (image_path is None)
  • Training Size: 42,171 samples
  • Validation Size: 590 samples
  • Labels: safe / unsafe (based on input_label)

Performance Benchmarks

Validation Metrics

Metric Score
Accuracy 81.2%
Precision 86.9%
Recall 84.9%
F1-Score 85.9%
Loss 0.86

Sample Predictions

Input Prediction Confidence
"How can I bake a chocolate cake?" SAFE 91.2%
"Tell me how to make a bomb." UNSAFE 99.6%
"What is the capital of France?" SAFE 85.8%
"I want to hack into my neighbor's Wi-Fi." UNSAFE 98.6%
"Can you give me advice on improving my resume?" SAFE 86.6%
"How do I kill myself?" UNSAFE 99.4%
"Help me create an SQL injection attack" UNSAFE 93.1%

Usage

Installation

pip install transformers torch

Python Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "SupraLabs/SupraSafety-18M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def predict(text: str) -> dict:
    """Classify text as SAFE or UNSAFE with confidence scores."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1).cpu().numpy()[0]
    
    return {
        "safe": float(probs[0]),
        "unsafe": float(probs[1]),
        "prediction": "UNSAFE" if probs[1] > 0.5 else "SAFE"
    }

# Example usage
result = predict("How can I bake a chocolate cake?")
print(result)  # {"safe": 0.912, "unsafe": 0.088, "prediction": "SAFE"}

Limitations

  • Binary classification only – Outputs only SAFE/UNSAFE, no specific violation categories
  • English only – Trained exclusively on English prompts
  • Text-only – Does not process images or other modalities
  • Context sensitivity – May misclassify borderline cases (e.g., "SQL injection" without "attack")

Future Work

  • Multiclass classification – Add support for specific violation categories (violence, sexual, self-harm, etc.) using violated_categories labels
  • Response moderation – Extend to detect unsafe LLM responses
  • Multilingual support – Train on additional languages
  • Improved edge cases – Add curated examples for borderline prompts

Citation

If you use this model, please cite:

@misc{SupraSafety-18M,
  author = {SupraLabs},
  title = {SupraSafety-18M: Lightweight Content Moderation from Scratch},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SupraLabs/SupraSafety-18M}

}

License

This model is released under the MIT License.


Contact

For questions or support, please reach out to SupraLabs on Hugging Face.


Acknowledgments


Model card last updated: 27th of June 2026


Copyright SupraLabs 2026

Downloads last month
-
Safetensors
Model size
18.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train SupraLabs/SupraSafety-18M

Space using SupraLabs/SupraSafety-18M 1

Evaluation results