TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
Abstract
A novel framework injects semantic intent into Mixture-of-Experts routing for image generation and editing, resolving task interference through hierarchical task annotation and predictive alignment regularization.
Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference. In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation (2025)
- 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory (2025)
- Unified Personalized Understanding, Generating and Editing (2026)
- Loom: Diffusion-Transformer for Interleaved Generation (2025)
- Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach. (2025)
- MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation (2025)
- WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper