Generates nonsense if running latest VLLM with Flashinfer 0.4

by stev236 - opened Oct 20, 2025

Nice and super fast model. These hybrid models (Jamba, Granite 4 H, and Qwen3 Next) are clearly the future.

Unfortunately, the latest version of vllm generates nonsense with this model if using the flashinfer backend (0.4 up).

Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.

Anybody else noticed that?

Yeah, I have the same problem! It worked rather nice before the latest vLLM update from NVIDIA (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.11-py3). I'm using the DGX Spark.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment