Qwen3-VL-32B-Instruct-EXL3-4.0bpw

ExLlamaV3 quantization of Qwen/Qwen3-VL-32B-Instruct - A powerful vision-language model for multimodal tasks.

Quantization Details

Parameter	Value
Bits per Weight	4.0 bpw
Head Bits	6 bpw
Calibration Rows	128
Calibration Context	4096 tokens
Format	ExLlamaV3 (EXL3)
Size	~19 GB

Model Capabilities

Vision Understanding: Process images at various resolutions
Video Analysis: Frame-by-frame understanding
Context Window: Up to 128K tokens
Instruction Following: Fine-tuned for chat and task completion
Multilingual: Strong performance across languages

Hardware Requirements

GPU	VRAM	Notes
RTX 4090	24 GB	Good fit, comfortable with images
RTX 3090	24 GB	Works well
A100 40GB	40 GB	Plenty of headroom

Use Cases

Live Assistant: Real-time screen understanding
Document Processing: Extract and analyze document content
Image Description: Detailed visual descriptions
Visual Coding: Understand code in screenshots
Chart/Graph Analysis: Interpret data visualizations

Usage with TabbyAPI

# config.yml
model:
  model_dir: models
  model_name: Qwen3-VL-32B-Instruct-EXL3-4.0bpw

network:
  host: 0.0.0.0
  port: 5000

model_defaults:
  max_seq_len: 16384
  cache_mode: Q4

Recommended Settings

Temperature: 0.7
Top-P: 0.8
Top-K: 20
Repetition Penalty: 1.05

Comparison with Thinking Variant

Model	Best For
This (Instruct)	Fast responses, direct answers, general tasks
Thinking variant	Complex reasoning, step-by-step analysis

Original Model

This is a quantization of Qwen/Qwen3-VL-32B-Instruct. All credit for the base model goes to the Qwen team at Alibaba.

License

Apache 2.0 (inherited from base model)

Downloads last month: 28

Safetensors

Model size

9B params

Tensor type

F16

I16

BF16

Model tree for nullrunner/Qwen3-VL-32B-Instruct-EXL3-4.0bpw

Base model

Qwen/Qwen3-VL-32B-Instruct

Quantized

(29)

this model