πŸ›‘οΈ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)

This repository provides a lightweight, encoder-only multi-turn classifier designed to detect spam and unwanted content across emails and chat conversations.

It supports short and long messages, as well as multi-turn conversational inputs (meta data + message)

It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.

This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.


✨ Features

  • Encoder-only architecture β†’ gives scores
  • Multi-turn support β†’ handles conversation history and context windows
  • Hybrid input domain β†’ optimized for both chat messages & email bodies
  • High-throughput β†’ suitable for millions of messages/day
  • Ideal for security filters (spam, scams, phishing, self-promotion content)
  • Open-source and deployable anywhere (CPU or GPU)

πŸ”§ Model Architecture

  • Type: Encoder-only (XLM Roberta Large)

  • Input format:

[CONTEXT 1] [CONTEXT 2] ... [USER MESSAGE]

Labels include:

  • spam
  • regular (ham)
  • marketing
  • gibberish

Benchmark

  • F1 Spam: 0.90
  • F1 Regular: 0.95
  • F1 Marketing: 0.87
  • F1 Gibberish: 0.94

While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.

Downloads last month
3
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for baptistejamin/xlm-roberta-large-spam_v4

Finetuned
(893)
this model