NuExtract3-FP8 and NuExtract3-W4A16: MTP speculative decoding broken — missing model_mtp.safetensors + 0% acceptance rate

by samimh23 - opened Jun 9

Jun 9

Summary

MTP speculative decoding is completely broken in
NuExtract3-FP8 and NuExtract3-W4A16. We investigated
deeply and found 3 separate issues.

Bug 1 — model_mtp.safetensors missing from FP8 model ❌

# BF16 model files:
ls numind/NuExtract3/*.safetensors
→ model.safetensors
→ model_mtp.safetensors  ✅

# FP8 model files:
ls numind/NuExtract3-FP8/*.safetensors
→ model-00001-of-00002.safetensors
→ model-00002-of-00002.safetensors
→ model_mtp.safetensors  ❌ MISSING!

The MTP head file was not included in the FP8 quantization release.

Bug 2 — MTP config fields missing from config.json ❌

# BF16 config.json → text_config:
"mtp_num_hidden_layers": 1        ✅
"mtp_use_dedicated_embeddings": False  ✅

# FP8 config.json → text_config:
"mtp_num_hidden_layers": None     ❌ MISSING
"mtp_use_dedicated_embeddings": None  ❌ MISSING

vLLM uses these fields to detect and enable MTP.
Without them, MTP is silently disabled.

Same issue exists in NuExtract3-W4A16.

Bug 3 — 0% MTP acceptance even after manual fix ❌

We manually copied model_mtp.safetensors from BF16
and added the missing config fields. vLLM now detects MTP:

"Detected MTP model. Sharing target model embedding weights"

But acceptance rate remains 0.000% across all positions:

SpecDecoding metrics:
  Total drafted:  200,000+ tokens
  Total accepted: 0 tokens
  Per-position:   0.000, 0.000, 0.000, 0.000
  Acceptance rate: 0.0%

Root cause: The MTP head was trained on BF16 hidden
states. FP8/W4A16 quantized weights produce different
hidden state distributions → MTP predictions are always
wrong.

Throughput Impact

Model	MTP Acceptance	Throughput
NuExtract3 BF16 + vLLM dynamic FP8 + MTP	88.1%	53,216 req/day
NuExtract3-FP8 + MTP (after manual fix)	0.0%	~16,000 req/day
NuExtract3-W4A16 + MTP	0.0%	~12,506 req/day

BF16 + vLLM dynamic quantization is 3.3x faster than
the official FP8 model because MTP works correctly with
BF16 hidden states.

Suggested Fixes

Include model_mtp.safetensors in FP8 and W4A16 releases
Add MTP config fields to text_config in config.json
Recalibrate MTP head after quantization — fine-tune
the MTP head on the quantized model's hidden states so
it learns the FP8/W4A16 activation distribution

Environment

GPU:    NVIDIA L4 (SM89, 24GB)
vLLM:   0.22.1
Models: numind/NuExtract3-FP8, numind/NuExtract3-W4A16

Also related to vLLM PR #45038 (FP8 KV crash fix)
which was filed from our vLLM issue #44879.

Happy to provide additional test data! 🚀

SorenDreano

NuMind org Jun 10

Good catch, it should be fixed now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment