Silero VAD v5 — MLX

MLX-compatible weights for Silero VAD v5, converted from the official JIT model.

Model

Silero VAD v5 is a lightweight (~309K params) voice activity detection model that processes 512-sample chunks (32ms @ 16kHz) with sub-millisecond latency. It outputs a speech probability between 0 and 1 for each chunk, with LSTM state carried across chunks for streaming operation.

Architecture: STFT → 4×Conv1d+ReLU encoder → LSTM(128) → Conv1d decoder → sigmoid

Usage (Swift / MLX)

import SpeechVAD

// Load model
let vad = try await SileroVADModel.fromPretrained()

// Streaming: process 512-sample chunks
let prob = vad.processChunk(samples)  // → 0.0...1.0

// Batch: detect speech segments in complete audio
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
    print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_silero_vad.py --upload

Converts the official Silero VAD v5 JIT model via torch.hub, transposes Conv1d weights for MLX channels-last format, sums LSTM biases (bias_ih + bias_hh), and saves as safetensors.

Weight Mapping

JIT Key	MLX Key	Shape
`_model.stft.forward_basis_buffer`	`stft.weight`	[258, 256, 1]
`_model.encoder.{i}.reparam_conv.weight`	`encoder.{i}.weight`	varies
`_model.encoder.{i}.reparam_conv.bias`	`encoder.{i}.bias`	varies
`_model.decoder.rnn.weight_ih`	`lstm.Wx`	[512, 128]
`_model.decoder.rnn.weight_hh`	`lstm.Wh`	[512, 128]
`_model.decoder.rnn.bias_ih + bias_hh`	`lstm.bias`	[512]
`_model.decoder.decoder.2.weight`	`decoder.weight`	[1, 1, 128]
`_model.decoder.decoder.2.bias`	`decoder.bias`	[1]

License

The original Silero VAD model is released under the MIT License.

Downloads last month: 7

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support