Generates nonsense if running latest VLLM with Flashinfer 0.4
#7
by
stev236 - opened
Nice and super fast model. These hybrid models (Jamba, Granite 4 H, and Qwen3 Next) are clearly the future.
Unfortunately, the latest version of vllm generates nonsense with this model if using the flashinfer backend (0.4 up).
Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.
See https://github.com/vllm-project/vllm/issues/26936
Anybody else noticed that?
Yeah, I have the same problem! It worked rather nice before the latest vLLM update from NVIDIA (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.11-py3). I'm using the DGX Spark.