Qwen3-4B-2507	Qwen3-1.7B	Qwen3.5-2B	Qwen3.5-0.8B
Non-Thinking Mode
MMLU-Pro	69.6	40.2	55.3	29.7
MMLU-Redux	84.2	64.4	69.2	48.5
C-Eval	80.2	61.0	65.2	46.4
SuperGPQA	42.8	21.0	30.4	16.9
IFEval	83.4	68.2	61.2	52.1
MMMLU	64.9	46.7	56.9	34.1
Knowledge & STEM (Thinking)
MMLU-Pro	74.0	56.5	66.5	42.3
MMLU-Redux	86.1	73.9	79.6	59.5
C-Eval	82.2	68.1	73.2	50.5
SuperGPQA	47.8	31.2	37.5	21.3
GPQA	65.8	40.1	51.6	11.9
Instruction Following (Thinking）
IFEval	87.4	72.5	78.6	44.0
IFBench	50.4	26.7	41.3	21.0
MultiChallenge	41.7	27.2	33.7	18.9
Long Context (Thinking）
AA-LCR	32.0	6.7	25.6	4.7
LongBench v2	42.8	26.5	38.7	26.1
Reasoning (Thinking）
HMMT Feb 25	57.5	10.2	22.9	--
HMMT Nov 25	69.6	8.9	19.6	--
General Agent (Thinking）
BFCL-V4	39.9	--	43.6	25.3
TAU2-Bench	43.2	--	48.8	11.6
Multilingualism (Thinking）
MMMLU	70.8	57.0	63.1	44.3
MMLU-ProX	62.4	49.4	52.3	34.6
NOVA-63	47.1	40.3	46.4	42.4
INCLUDE	64.4	51.8	55.4	40.6
Global PIQA	73.5	63.1	69.3	59.4
PolyMATH	46.2	25.2	26.1	8.2
WMT24++	58.9	39.3	45.8	27.2
MAXIFE	72.1	50.7	60.6	39.2

* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Experimental settings: top_p=0.95, top_k=20, presence_penalty=1.5, and temperature=1.0 were used.
* Empty cells (--) indicate scores not yet available or not applicable.

	Qwen3-VL-4B	Qwen3-VL-2B	Qwen3.5-2B	Qwen3.5-0.8B
STEM and Puzzle
MMMU	70.8	61.4	64.2/64.2	49/47.4
MMMU-Pro	57.0	42.5	50.3/47.7	31.2/31.4
Mathvista(mini)	79.5	73.6	76.7/73.9	62.2/58.6
DynaMath	74.4	66.7	73.6/69.6	49.9/46.5
ZEROBench	0.0	0.0	1.0/0.0	0.0/0.0
ZEROBench_sub	18.9	13.2	17.1/18.6	12.9/11.4
VlmsAreBlind	68.6	50.0	75.8/74.3	59.4/57.3
General VQA
RealWorldQA	73.2	69.5	74.5/71.2	63.4/61.6
MMStar	73.2	68.1	71.7/68.0	58.3/55.9
MMBench_EN-DEV-v1.1	86.7	81.9	83.3/81.3	69.9/68.0
SimpleVQA	48.8	43.6	38.5/39.5	31.3/30.4
HallusionBench	64.1	54.9	58.0/51.3	53.1/46.7
Text Recognition and Document Understanding
MMLongBench-Doc	44.4	33.8	45.4/38.8	33.6/28.1
AI2D_TEST	84.9	80.4	83.3/81.5	69.9/68.7
CC-OCR	73.8	68.3	72.9/75.8	63.2/66.7
OmniDocBench1.5	80.0	65.9	79.8/80.9	61.0/70.6
CharXiv(RQ)	50.3	37.1	58.8/52.6	41.3/38.2
OCRBench	80.8	79.2	84.5/85.4	74.5/79.1
Spatial Intelligence
RefCOCO(avg)	88.2	84.8	84.8/84.3	79.3/77.8
CountBench	89.4	84.1	91.4/86.8	77.0/68.6
ODInW13	39.4	36.0	35.9/40.5	31.6/33.2
ERQA	47.3	41.8	43.8/33.0	34.5/23.8
EmbSpatialBench	80.7	75.9	77.9/66.4	68.6/54.6
RefSpatialBench	45.3	28.9	32.9/30.0	23.5/21.7
Hypersim	11.9	11.2	12.4/12.4	11.9/11.0
SUNRGBD	28.0	28.6	28.7/25.6	26.1/23.3
Nuscene	4.9	4.0	6.9/8.5	5.7/7.0
Video Understanding
VideoMME_{(w sub.)}	76.0	67.9	75.6/--	63.8/--
VideoMME_{(w/o sub.)}	68.9	62.1	69.0/--	57.7/--
VideoMMMU	69.4	54.1	62.1/--	44.3/--
MLVU	75.7	69.2	76.2/--	65.6/--
MVBench	69.3	64.5	64.9/--	55.8/--
LVBench	53.5	47.6	57.1/--	45.1/--
MMVU	58.6	48.9	48.6/--	34.3/--
Visual Agent
ScreenSpot Pro	59.5	48.5	--/54.5	--/46.5
Medical VQA
SLAKE	65.9	61.1	74.4/67.5	62.6/59.5
PMC-VQA	48.4	42.4	48.8/54.0	40.4/45.5
MedXpertQA-MM	26.3	13.0	26.9/19.1	17.1/25.3

* Scores of Qwen3.5 models are reported as Thinking / Non-thinking.
* MathVision: our model’s score is evaluated using a fixed prompt, e.g., “Please reason step by step, and put your final answer within \boxed{}.” For other models, we report the higher score between runs with and without the \boxed{} formatting.
* Experimental settings: For the Video benchmarks, we used top_p=0.95, top_k=20, presence_penalty=1.5, and temperature=1.0. All other benchmarks adopted the same sampling configuration but with temperature=0.6 under the thinking mode. Under the non-thinking mode, the sampling parameters were set to top_p=0.8, top_k=20, presence_penalty=1.5, and temperature=0.7.
* Empty cells (--) indicate scores not yet available or not applicable.