Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning
π‘ Overview
Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning.
We introduce Metis-HOME (Hybrid Optimized Mixture-of-Experts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branchesβa Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inferenceβcontrolled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.
β¨ Highlights
π§ Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.
π Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.
π Performance:
- +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
- ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.
π οΈ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.
π Results
Thinking Ratio
As shown in the following figure, the thinking ratio analysis of Metis-HOME reveals adaptive routing behavior:
- High ratios (78%β98%) on reasoning-heavy benchmarks (WeMath, MathVision, etc.), indicating effective use of the thinking expert for multi-step inference.
- Low ratios (2%β5%) on general benchmarks (MMBench, OCRBench), showing preference for the non-thinking expert.
This aligns with our design: deliberate reasoning for complex tasks, fast inference for simple ones, optimizing computational efficiency.
Benchmarks
| Model | Reasoning | General | ||||||
|---|---|---|---|---|---|---|---|---|
| MathVista | MathVision | MathVerse | DynaMath | WeMath | LogicVista | Avg. | Avg. | |
| Proprietary Models | ||||||||
| Gemini-2.0-Pro | 71.3 | 48.1 | 67.3 | 43.3 | 56.5 | 53.2 | 56.6 | 73.3 |
| Gemini-2.0-Flash | 70.4 | 43.6 | 47.8 | 42.1 | 47.4 | 52.3 | 50.6 | 72.6 |
| Claude 3.7 Sonnet | 66.8 | 41.9 | 46.7 | 39.7 | 49.3 | 58.2 | 50.4 | 70.1 |
| ChatGPT-4o | 60.0 | 31.2 | 40.6 | 34.5 | 45.8 | 52.8 | 44.2 | 72.0 |
| Open-source Models | ||||||||
| LLaVA-OneVision-72B | 67.1 | 25.3 | 27.2 | 15.6 | 32.0 | 40.9 | 34.7 | 68.0 |
| Kimi-VL-A3B-Instruct | 66.0 | 21.8 | 34.1 | 18.0 | 32.3 | 42.7 | 35.8 | 69.1 |
| InternVL3-8B | 70.5 | 30.0 | 38.5 | 25.7 | 39.5 | 44.5 | 41.4 | 73.6 |
| VL-Rethinker-7B | 75.5 | 29.3 | 47.2 | 25.4 | 37.8 | 47.0 | 43.7 | 68.3 |
| Metis-RISE-7B | 75.8 | 28.7 | 51.0 | 27.7 | 45.2 | 49.7 | 46.4 | 68.4 |
| Baseline | 67.4 | 26.2 | 41.1 | 20.2 | 34.5 | 45.6 | 39.2 | 70.3 |
| Baseline+RL | 72.8 | 28.7 | 46.8 | 26.2 | 43.3 | 46.5 | 44.0 | 67.2 |
| Metis-HOME | 76.0 | 29.5 | 47.7 | 26.4 | 45.6 | 51.5 | 46.1 | 71.2 |
π Usage Example
You can use the demo inference script in the examples folder:
python examples/demo_inference.py
π Acknowledgement
We sincerely appreciate LLaMA-Factory and MM-EUREKA for providing reference training framework.
π Citation
@article{lan2025metis,
title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
journal={arXiv preprint arXiv:2510.20519},
year={2025}
}
- Downloads last month
- 19