Any-to-Any
Transformers
Safetensors
English
xoron
multimodal
Mixture of Experts
text-to-image
image editing
image to video
text-to-video
video editing
text-to-speech
speech-to-text
speech-to-speech
image-to-text
video-to-text
agentic
tool-use
flow-matching
3d-rope
titok
vidtok
dual-stream-attention
zero-shot-voice-cloning
bigvgan
snake-activation
multi-receptive-field-fusion
custom_code
Update README.md
Browse files
README.md
CHANGED
|
@@ -21,6 +21,7 @@ tags:
|
|
| 21 |
- flow-matching
|
| 22 |
- 3d-rope
|
| 23 |
- titok
|
|
|
|
| 24 |
- dual-stream-attention
|
| 25 |
- zero-shot-voice-cloning
|
| 26 |
- bigvgan
|
|
@@ -107,15 +108,16 @@ datasets:
|
|
| 107 |
|
| 108 |
</div>
|
| 109 |
|
|
|
|
| 110 |
**Xoron-Dev** is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a **Mixture of Experts (MoE)** backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.
|
| 111 |
|
| 112 |
## π Model Highlights
|
| 113 |
|
| 114 |
* **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
|
| 115 |
-
* **
|
| 116 |
-
* **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** (4 experts) for video (8-
|
| 117 |
-
* **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output:
|
| 118 |
-
* **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-
|
| 119 |
* **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
|
| 120 |
* **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
|
| 121 |
* **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
|
|
@@ -151,12 +153,16 @@ datasets:
|
|
| 151 |
| Position Encoding | 2D-RoPE |
|
| 152 |
| Output Tokens | 64 tokens per image |
|
| 153 |
|
| 154 |
-
### π¬ Video Encoder (3D Causal Transformers)
|
| 155 |
| Feature | Description |
|
| 156 |
|---------|-------------|
|
| 157 |
-
| Frame
|
| 158 |
-
| Resolution
|
| 159 |
| Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
| Attention | 3D Causal Self-Attention |
|
| 161 |
| Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
|
| 162 |
| Encoder Layers | 4 layers |
|
|
@@ -166,7 +172,7 @@ datasets:
|
|
| 166 |
|---------|-------------|
|
| 167 |
| Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
|
| 168 |
| Scheduler | **Flow Matching** (not DDPM) |
|
| 169 |
-
| Output Resolution |
|
| 170 |
| Position Encoding | 2D-RoPE |
|
| 171 |
| Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
|
| 172 |
| MoE Experts | 4 experts in DiT blocks |
|
|
@@ -176,22 +182,22 @@ datasets:
|
|
| 176 |
### πΉ Video Generation (3D Causal + Flow Matching)
|
| 177 |
| Feature | Description |
|
| 178 |
|---------|-------------|
|
| 179 |
-
| Output Resolution | 128-
|
| 180 |
-
| Output Frames | 8-
|
| 181 |
| Scheduler | **Flow Matching** |
|
| 182 |
| Position Encoding | **3D-RoPE** for (x, y, t) |
|
| 183 |
| Attention | Factorized Spatial-Temporal (3D Causal) |
|
| 184 |
| Expert Routing | **Temporal MoE** (4 experts) |
|
| 185 |
| Guidance Scale | 7.5 (CFG) |
|
| 186 |
|
| 187 |
-
### π
|
| 188 |
-
| Type |
|
| 189 |
-
|
| 190 |
-
| **Image** | 128
|
| 191 |
-
| **Video** | 128
|
| 192 |
-
| **Frames** | 8
|
| 193 |
|
| 194 |
-
|
| 195 |
|
| 196 |
### π€ Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
|
| 197 |
| Feature | Description |
|
|
@@ -240,4 +246,4 @@ To bridge the gap between general knowledge and actionable agentic behavior, we
|
|
| 240 |
| **Code Execution** | Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
|
| 241 |
| **Git Operations** | Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding. |
|
| 242 |
| **Chain-of-Thought** | Explicit `Synth-CoT` data to encourage internal reasoning before generating final answers. |
|
| 243 |
-
| **File Operations** | Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation. |
|
|
|
|
| 21 |
- flow-matching
|
| 22 |
- 3d-rope
|
| 23 |
- titok
|
| 24 |
+
- vidtok
|
| 25 |
- dual-stream-attention
|
| 26 |
- zero-shot-voice-cloning
|
| 27 |
- bigvgan
|
|
|
|
| 108 |
|
| 109 |
</div>
|
| 110 |
|
| 111 |
+
# 
|
| 112 |
**Xoron-Dev** is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a **Mixture of Experts (MoE)** backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.
|
| 113 |
|
| 114 |
## π Model Highlights
|
| 115 |
|
| 116 |
* **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
|
| 117 |
+
* **Continuous-Scale Training:** Adaptive strategy samples ANY scale in range - images (128-384px), videos (128-320px), frames (8-24).
|
| 118 |
+
* **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **VidTokTokenizer** (full 3D VAE with 4x8x8 compression) + **Temporal MoE** (4 experts) for video (8-24 frames).
|
| 119 |
+
* **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output: 192-384px, 50 inference steps.
|
| 120 |
+
* **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-24 frames @ 128-320px.
|
| 121 |
* **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
|
| 122 |
* **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
|
| 123 |
* **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
|
|
|
|
| 153 |
| Position Encoding | 2D-RoPE |
|
| 154 |
| Output Tokens | 64 tokens per image |
|
| 155 |
|
| 156 |
+
### π¬ Video Encoder (3D Causal Transformers + VidTok)
|
| 157 |
| Feature | Description |
|
| 158 |
|---------|-------------|
|
| 159 |
+
| Frame Range | 8-24 frames (continuous-scale) |
|
| 160 |
+
| Resolution Range | 128-320px (continuous-scale) |
|
| 161 |
| Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
|
| 162 |
+
| VidTokTokenizer | Full 3D VAE (Microsoft VidTok architecture) |
|
| 163 |
+
| Compression | 4x temporal, 8x8 spatial (4x8x8 total) |
|
| 164 |
+
| Architecture | 2D+1D efficient design with AlphaBlender |
|
| 165 |
+
| Quantization | Continuous (KL) or Discrete (FSQ) |
|
| 166 |
| Attention | 3D Causal Self-Attention |
|
| 167 |
| Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
|
| 168 |
| Encoder Layers | 4 layers |
|
|
|
|
| 172 |
|---------|-------------|
|
| 173 |
| Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
|
| 174 |
| Scheduler | **Flow Matching** (not DDPM) |
|
| 175 |
+
| Output Resolution | 192-384px (continuous-scale, step=32) |
|
| 176 |
| Position Encoding | 2D-RoPE |
|
| 177 |
| Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
|
| 178 |
| MoE Experts | 4 experts in DiT blocks |
|
|
|
|
| 182 |
### πΉ Video Generation (3D Causal + Flow Matching)
|
| 183 |
| Feature | Description |
|
| 184 |
|---------|-------------|
|
| 185 |
+
| Output Resolution | 128-320px (continuous-scale, step=32) |
|
| 186 |
+
| Output Frames | 8-24 frames (continuous-scale, step=4) |
|
| 187 |
| Scheduler | **Flow Matching** |
|
| 188 |
| Position Encoding | **3D-RoPE** for (x, y, t) |
|
| 189 |
| Attention | Factorized Spatial-Temporal (3D Causal) |
|
| 190 |
| Expert Routing | **Temporal MoE** (4 experts) |
|
| 191 |
| Guidance Scale | 7.5 (CFG) |
|
| 192 |
|
| 193 |
+
### π Continuous-Scale Training Configuration
|
| 194 |
+
| Type | Range | Base | Step |
|
| 195 |
+
|------|-------|------|------|
|
| 196 |
+
| **Image** | 128-384px | 256px | 32px |
|
| 197 |
+
| **Video** | 128-320px | 192px | 32px |
|
| 198 |
+
| **Frames** | 8-24 | 16 | 4 |
|
| 199 |
|
| 200 |
+
Continuous-scale training is **enabled by default** with **adaptive** strategy - dynamically adjusts scale ranges based on OOM history for optimal memory usage.
|
| 201 |
|
| 202 |
### π€ Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
|
| 203 |
| Feature | Description |
|
|
|
|
| 246 |
| **Code Execution** | Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
|
| 247 |
| **Git Operations** | Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding. |
|
| 248 |
| **Chain-of-Thought** | Explicit `Synth-CoT` data to encourage internal reasoning before generating final answers. |
|
| 249 |
+
| **File Operations** | Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation. |
|