This model was quantized on an RTX 5090 with SM_120. The shape/size of the prameters are likely incompatible with other blackwell chips on SM_100 (i.e. B200) but I don't have one to verify.

TensorRT can't load this model - it sees some 1300 odd NaN parameters - I think this is related to the vllm-specific format.

There is an issue for the 5090 with the cutlass flash attention module, but once that is resolved the below should work

Run it with VLLM in Docker Compose like this:

version: '3.8'

services:
  vllm-nvfp4:
    image: vllm/vllm-openai:nightly
    container_name: vllm-server-nvfp4
    ports:
      - "8005:8000"
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_USE_TRTLLM_ATTENTION=1  # Massive prompt processing speedup
      - HF_TOKEN=${HF_TOKEN}  # Set your Hugging Face token in .env file
    volumes:
      - ./models:/models
      - ./cache:/root/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: [
      "--dtype", "bfloat16",
      "--enable-auto-tool-choice",  # Required for tool calling to work
      "--gpu-memory-utilization", "0.94",  # Higher utilization for better performance
      "--host", "0.0.0.0",
      "--kv-cache-dtype", "fp8",  # FP8 KV cache for memory optimization
      "--max-model-len", "40960",  # 40K context
      "--max-num-batched-tokens", "16384",  # Higher batched token processing
      "--max-num-seqs", "100",  # Should be enough to saturate a 5090
      "--model", "Bellesteck/Qwen3-30B-A3B-NVFP4-vLLM",  # Coding-optimized 30B compressed-tensors model
      "--port", "8000",
      "--quantization", "compressed-tensors",  # Model uses compressed-tensors quantization
      "--served-model-name", "gpt-5",  # Override model name for API compatibility
      "--tool-call-parser", "qwen3_coder"  # Tool call parser for function calling
    ]
    restart: unless-stopped
    shm_size: '2gb'
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host

Downloads last month: 12

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support