GGUF
draft
speculative-decoding

Hello, I want to know if the draft model will reduce the model generation quality?

#2
by lingyezhixing - opened

In my understanding, the acceptance of the draft model depends on the similarity of the probability distribution, generally set to 0.8~0.95, will this not lead to a decline in model performance?

In my understanding, the acceptance of the draft model depends on the similarity of the probability distribution, generally set to 0.8~0.95, will this not lead to a decline in model performance?

It won't make the main model choose different tokens, but it can end up worse performance in terms of tokens/s if the draft model doesn't predict the main model very well. It also requires you to use top_k = 1 or temperature = 0 to work properly.

I did read in another thread that this particular draft doesn't work that well for the GLM-4.5-Air model; likely because it only has ~10B active parameters, and the potential gain in tokens/s relies on the draft model being much smaller than the main model (the full GLM-4.5 model has 30B+ active parameters, so likely to work much better for it).

Sign up or log in to comment