Silero VAD v5 โ€” MLX

MLX-compatible weights for Silero VAD v5, converted from the official JIT model.

Model

Silero VAD v5 is a lightweight (~309K params) voice activity detection model that processes 512-sample chunks (32ms @ 16kHz) with sub-millisecond latency. It outputs a speech probability between 0 and 1 for each chunk, with LSTM state carried across chunks for streaming operation.

Architecture: STFT โ†’ 4ร—Conv1d+ReLU encoder โ†’ LSTM(128) โ†’ Conv1d decoder โ†’ sigmoid

Usage (Swift / MLX)

import SpeechVAD

// Load model
let vad = try await SileroVADModel.fromPretrained()

// Streaming: process 512-sample chunks
let prob = vad.processChunk(samples)  // โ†’ 0.0...1.0

// Batch: detect speech segments in complete audio
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
    print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_silero_vad.py --upload

Converts the official Silero VAD v5 JIT model via torch.hub, transposes Conv1d weights for MLX channels-last format, sums LSTM biases (bias_ih + bias_hh), and saves as safetensors.

Weight Mapping

JIT Key MLX Key Shape
_model.stft.forward_basis_buffer stft.weight [258, 256, 1]
_model.encoder.{i}.reparam_conv.weight encoder.{i}.weight varies
_model.encoder.{i}.reparam_conv.bias encoder.{i}.bias varies
_model.decoder.rnn.weight_ih lstm.Wx [512, 128]
_model.decoder.rnn.weight_hh lstm.Wh [512, 128]
_model.decoder.rnn.bias_ih + bias_hh lstm.bias [512]
_model.decoder.decoder.2.weight decoder.weight [1, 1, 128]
_model.decoder.decoder.2.bias decoder.bias [1]

License

The original Silero VAD model is released under the MIT License.

Downloads last month
7
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support