Qwen3-VL-32B-Instruct-EXL3-4.0bpw

ExLlamaV3 quantization of Qwen/Qwen3-VL-32B-Instruct - A powerful vision-language model for multimodal tasks.

Quantization Details

Parameter Value
Bits per Weight 4.0 bpw
Head Bits 6 bpw
Calibration Rows 128
Calibration Context 4096 tokens
Format ExLlamaV3 (EXL3)
Size ~19 GB

Model Capabilities

  • Vision Understanding: Process images at various resolutions
  • Video Analysis: Frame-by-frame understanding
  • Context Window: Up to 128K tokens
  • Instruction Following: Fine-tuned for chat and task completion
  • Multilingual: Strong performance across languages

Hardware Requirements

GPU VRAM Notes
RTX 4090 24 GB Good fit, comfortable with images
RTX 3090 24 GB Works well
A100 40GB 40 GB Plenty of headroom

Use Cases

  • Live Assistant: Real-time screen understanding
  • Document Processing: Extract and analyze document content
  • Image Description: Detailed visual descriptions
  • Visual Coding: Understand code in screenshots
  • Chart/Graph Analysis: Interpret data visualizations

Usage with TabbyAPI

# config.yml
model:
  model_dir: models
  model_name: Qwen3-VL-32B-Instruct-EXL3-4.0bpw

network:
  host: 0.0.0.0
  port: 5000

model_defaults:
  max_seq_len: 16384
  cache_mode: Q4

Recommended Settings

  • Temperature: 0.7
  • Top-P: 0.8
  • Top-K: 20
  • Repetition Penalty: 1.05

Comparison with Thinking Variant

Model Best For
This (Instruct) Fast responses, direct answers, general tasks
Thinking variant Complex reasoning, step-by-step analysis

Original Model

This is a quantization of Qwen/Qwen3-VL-32B-Instruct. All credit for the base model goes to the Qwen team at Alibaba.

License

Apache 2.0 (inherited from base model)

Downloads last month
28
Safetensors
Model size
9B params
Tensor type
F16
I16
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for nullrunner/Qwen3-VL-32B-Instruct-EXL3-4.0bpw

Quantized
(29)
this model