Generates nonsense if running latest VLLM with Flashinfer 0.4

#7
by stev236 - opened

Nice and super fast model. These hybrid models (Jamba, Granite 4 H, and Qwen3 Next) are clearly the future.

Unfortunately, the latest version of vllm generates nonsense with this model if using the flashinfer backend (0.4 up).

Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.

See https://github.com/vllm-project/vllm/issues/26936

Anybody else noticed that?

Yeah, I have the same problem! It worked rather nice before the latest vLLM update from NVIDIA (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.11-py3). I'm using the DGX Spark.

Sign up or log in to comment