NuExtract3-FP8 and NuExtract3-W4A16: MTP speculative decoding broken — missing model_mtp.safetensors + 0% acceptance rate

#1
by samimh23 - opened

Summary

MTP speculative decoding is completely broken in
NuExtract3-FP8 and NuExtract3-W4A16. We investigated
deeply and found 3 separate issues.


Bug 1 — model_mtp.safetensors missing from FP8 model ❌

# BF16 model files:
ls numind/NuExtract3/*.safetensors
→ model.safetensors
→ model_mtp.safetensors  ✅

# FP8 model files:
ls numind/NuExtract3-FP8/*.safetensors
→ model-00001-of-00002.safetensors
→ model-00002-of-00002.safetensors
→ model_mtp.safetensors  ❌ MISSING!

The MTP head file was not included in the FP8 quantization release.


Bug 2 — MTP config fields missing from config.json ❌

# BF16 config.json → text_config:
"mtp_num_hidden_layers": 1"mtp_use_dedicated_embeddings": False# FP8 config.json → text_config:
"mtp_num_hidden_layers": None     ❌ MISSING
"mtp_use_dedicated_embeddings": None  ❌ MISSING

vLLM uses these fields to detect and enable MTP.
Without them, MTP is silently disabled.

Same issue exists in NuExtract3-W4A16.


Bug 3 — 0% MTP acceptance even after manual fix ❌

We manually copied model_mtp.safetensors from BF16
and added the missing config fields. vLLM now detects MTP:

"Detected MTP model. Sharing target model embedding weights"

But acceptance rate remains 0.000% across all positions:

SpecDecoding metrics:
  Total drafted:  200,000+ tokens
  Total accepted: 0 tokens
  Per-position:   0.000, 0.000, 0.000, 0.000
  Acceptance rate: 0.0%

Root cause: The MTP head was trained on BF16 hidden
states. FP8/W4A16 quantized weights produce different
hidden state distributions → MTP predictions are always
wrong.


Throughput Impact

Model MTP Acceptance Throughput
NuExtract3 BF16 + vLLM dynamic FP8 + MTP 88.1% 53,216 req/day
NuExtract3-FP8 + MTP (after manual fix) 0.0% ~16,000 req/day
NuExtract3-W4A16 + MTP 0.0% ~12,506 req/day

BF16 + vLLM dynamic quantization is 3.3x faster than
the official FP8 model because MTP works correctly with
BF16 hidden states.


Suggested Fixes

  1. Include model_mtp.safetensors in FP8 and W4A16 releases
  2. Add MTP config fields to text_config in config.json
  3. Recalibrate MTP head after quantization — fine-tune
    the MTP head on the quantized model's hidden states so
    it learns the FP8/W4A16 activation distribution

Environment

GPU:    NVIDIA L4 (SM89, 24GB)
vLLM:   0.22.1
Models: numind/NuExtract3-FP8, numind/NuExtract3-W4A16

Also related to vLLM PR #45038 (FP8 KV crash fix)
which was filed from our vLLM issue #44879.

Happy to provide additional test data! 🚀

NuMind org

Good catch, it should be fixed now.

Sign up or log in to comment