Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.
π Model Highlights
Architecture: Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
Continuous-Scale Training: Adaptive strategy samples ANY scale in range - images (128-384px), videos (128-320px), frames (8-24).
Vision Encoder: SigLIP-2 (384px native) with TiTok-style 1D tokenization (256 compressed tokens), Dual-Stream Attention (2 layers), and 2D-RoPE for images; 3D-RoPE + VidTokTokenizer (full 3D VAE with 4x8x8 compression) + Temporal MoE (4 experts) for video (8-24 frames).
Image Generation:MoE-DiT (Diffusion Transformer with 4 MoE experts) using Flow Matching, 2D-RoPE, and Symmetric Dual-Stream Attention (SD3/Flux-style). Multi-scale output: 192-384px, 50 inference steps.
Video Generation:3D Causal Transformers (4 layers) with Flow Matching, 3D-RoPE for (x,y,t) positions, and Temporal Expert Routing (4 experts). Multi-scale: 8-24 frames @ 128-320px.
Audio (Speech-to-Speech):Conformer encoder with RMLA and Raw Waveform Tokenizer for ASR; Direct waveform decoder (no vocoder needed!) with MAS for TTS; Zero-Shot Speaker Cloning with In-Context Audio Prompting. Talk to it, and it talks back!
Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
Context: Efficient 128K context using Ring Attention (4096 chunk size).
Fine-tuning: LoRA variants including rsLoRA, DoRA, and LoRA+ (r=32, Ξ±=64, 4x B matrix learning rate).
Performance: Flash Attention support with FP16-native numerical stability.
π¬ Architecture Deep Dive
π§ LLM Backbone (MoE)
Component
Specification
Hidden Size
1024
Layers
12
Attention Heads
16
MoE Experts
8 + 1 Shared (DeepSeek-style isolation)
Experts per Token
2 (top-2 routing)
MoE Layer Frequency
Every 2 layers
Routing
Aux-Lossless MoE routing
Context Length
128K positions
Attention
Ring Attention (4096 chunk) + Flash Attention
Tokenizer
Qwen2.5 (151,643 vocab)
ποΈ Vision Encoder (SigLIP-2 + SOTA Extensions)
Feature
Description
Base Model
google/siglip-so400m-patch14-384
Input Resolution
384Γ384
TiTok Tokenization
1D tokenization with 256 compressed tokens
Dual-Stream Attention
2 symmetric dual-stream layers
Position Encoding
2D-RoPE
Output Tokens
64 tokens per image
π¬ Video Encoder (3D Causal Transformers + VidTok)
Feature
Description
Frame Range
8-24 frames (continuous-scale)
Resolution Range
128-320px (continuous-scale)
Position Encoding
3D-RoPE for (x, y, t) coordinates
VidTokTokenizer
Full 3D VAE (Microsoft VidTok architecture)
Compression
4x temporal, 8x8 spatial (4x8x8 total)
Architecture
2D+1D efficient design with AlphaBlender
Quantization
Continuous (KL) or Discrete (FSQ)
Attention
3D Causal Self-Attention
Expert Routing
Temporal MoE (4 experts, temporally-aware)
Encoder Layers
4 layers
π¨ Image Generation (MoE-DiT + Flow Matching)
Feature
Description
Architecture
MoE-DiT (Diffusion Transformer with MoE)
Scheduler
Flow Matching (not DDPM)
Output Resolution
192-384px (continuous-scale, step=32)
Position Encoding
2D-RoPE
Attention
Symmetric Dual-Stream Attention (SD3/Flux-style)
MoE Experts
4 experts in DiT blocks
Inference Steps
50 steps
Guidance Scale
7.5 (CFG)
πΉ Video Generation (3D Causal + Flow Matching)
Feature
Description
Output Resolution
128-320px (continuous-scale, step=32)
Output Frames
8-24 frames (continuous-scale, step=4)
Scheduler
Flow Matching
Position Encoding
3D-RoPE for (x, y, t)
Attention
Factorized Spatial-Temporal (3D Causal)
Expert Routing
Temporal MoE (4 experts)
Guidance Scale
7.5 (CFG)
π Continuous-Scale Training Configuration
Type
Range
Base
Step
Image
128-384px
256px
32px
Video
128-320px
192px
32px
Frames
8-24
16
4
Continuous-scale training is enabled by default with adaptive strategy - dynamically adjusts scale ranges based on OOM history for optimal memory usage.
π€ Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
Feature
Description
Sample Rate
16kHz
Encoder (ASR)
Raw Waveform Tokenizer β Conformer blocks with RMLA
Waveform Decoder
BigVGAN-style with Snake activation + MRF - no external vocoder!
KV Compression
LoRA-style KV compression (rank 256)
Decoder Alignment
MAS (Monotonic Alignment Search) for text-to-audio alignment
Voice Cloning
Zero-Shot Speaker Cloning with speaker embedding (256-dim)
In-Context Prompting
Enabled for voice cloning from reference audio
π Waveform Decoder (SOTA BigVGAN-style)
Direct audio output without external vocoder:
Feature
Description
Architecture
BigVGAN/HiFi-GAN style with transposed convolutions
256x total (rates: 8, 8, 2, 2) from features to 16kHz audio
Streaming
stream_decode() for low-latency real-time output
Output Range
[-1, 1] normalized waveform via tanh
π Training Data
Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.
π Open Source Datasets
We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:
Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
Tool Use: Datasets like Function-Calling-ChatML, Synth-APIGen, and Tool-Calls-MultiTurn enable precise tool invocation across single and multi-turn interactions.
Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.
π§ͺ Synthetic Data Pipeline
To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:
Category
Description
Anti-Hallucination
Training the model to say "I don't know" (Synth-IDK), verify facts (Synth-FactCheck), provide citations (Synth-Citation), express uncertainty (Synth-Uncertainty), and ground responses (Synth-GroundedResponse).
System Administration
Simulated environments for Docker setup, SSH configuration, database management, and package installation (Synth-AptInstall).
Code Execution
Traces of code execution including Shell errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations
Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding.
Chain-of-Thought
Explicit Synth-CoT data to encourage internal reasoning before generating final answers.
File Operations
Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation.