Self-Supervised
DINO Training
Pseudo-Labels
Creation
WavLM MHFA
Fine-Tuning
Large Margin
Fine-Tuning
Iterative (×2)
k-means
(50,000)
AHC
(7,500)
Dynamic Loss Gate
Linear
AAM Softmax
Label Correction
Linear
Linear
Concat
WavLM (pre-trained)
Key flow
Value flow
CNN Encoder
Pseudo-labels
Transformer Layer L
Transformer Layer 1
. . .
MHFA
Original
+ aug.
frames

Attentive Pooling
Speaker Embeddings
Reliable
labels
Unreliable
labels
Step 1
Step 2
Step 3
Step 4
Encoder
Projector
Student branch
Encoder
Projector
EMA
DINO loss
Teacher branch
DINO
Redundancy elimination and diversity regularization
4 short aug. frames
2 long aug. frames
Speaker Embeddings
Speaker Embeddings