Backup-bdg commited on
Commit
9234f2c
Β·
verified Β·
1 Parent(s): 091039b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -18
README.md CHANGED
@@ -21,6 +21,7 @@ tags:
21
  - flow-matching
22
  - 3d-rope
23
  - titok
 
24
  - dual-stream-attention
25
  - zero-shot-voice-cloning
26
  - bigvgan
@@ -107,15 +108,16 @@ datasets:
107
 
108
  </div>
109
 
 
110
  **Xoron-Dev** is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a **Mixture of Experts (MoE)** backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.
111
 
112
  ## 🌟 Model Highlights
113
 
114
  * **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
115
- * **Multi-Scale Training (NEW):** Random scale selection per batch - images (128-512px), videos (128-384px), frames (8-32 including 20).
116
- * **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** (4 experts) for video (8-32 frames).
117
- * **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output: 256-512px, 50 inference steps.
118
- * **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-32 frames @ 128-384px.
119
  * **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
120
  * **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
121
  * **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
@@ -151,12 +153,16 @@ datasets:
151
  | Position Encoding | 2D-RoPE |
152
  | Output Tokens | 64 tokens per image |
153
 
154
- ### 🎬 Video Encoder (3D Causal Transformers)
155
  | Feature | Description |
156
  |---------|-------------|
157
- | Frame Scales | 8, 12, 16, 24, 32 frames (multi-scale) |
158
- | Resolution Scales | 128, 192, 256, 320, 384px (multi-scale) |
159
  | Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
 
 
 
 
160
  | Attention | 3D Causal Self-Attention |
161
  | Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
162
  | Encoder Layers | 4 layers |
@@ -166,7 +172,7 @@ datasets:
166
  |---------|-------------|
167
  | Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
168
  | Scheduler | **Flow Matching** (not DDPM) |
169
- | Output Resolution | 256-512px (multi-scale: 256, 320, 384, 448, 512) |
170
  | Position Encoding | 2D-RoPE |
171
  | Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
172
  | MoE Experts | 4 experts in DiT blocks |
@@ -176,22 +182,22 @@ datasets:
176
  ### πŸ“Ή Video Generation (3D Causal + Flow Matching)
177
  | Feature | Description |
178
  |---------|-------------|
179
- | Output Resolution | 128-384px (multi-scale: 128, 192, 256, 320, 384) |
180
- | Output Frames | 8-32 frames (multi-scale: 8, 12, 16, 20, 24, 32) |
181
  | Scheduler | **Flow Matching** |
182
  | Position Encoding | **3D-RoPE** for (x, y, t) |
183
  | Attention | Factorized Spatial-Temporal (3D Causal) |
184
  | Expert Routing | **Temporal MoE** (4 experts) |
185
  | Guidance Scale | 7.5 (CFG) |
186
 
187
- ### πŸ“ Multi-Scale Training Configuration
188
- | Type | Scales | Probabilities |
189
- |------|--------|---------------|
190
- | **Image** | 128, 192, 256, 320, 384, 448, 512px | 5%, 10%, 30%, 25%, 15%, 10%, 5% |
191
- | **Video** | 128, 192, 256, 320, 384px | 10%, 20%, 35%, 25%, 10% |
192
- | **Frames** | 8, 12, 16, 20, 24, 32 | 10%, 15%, 30%, 20%, 15%, 10% |
193
 
194
- Multi-scale training is **enabled by default** with **random** strategy - each batch samples a different scale for variety.
195
 
196
  ### 🎀 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
197
  | Feature | Description |
@@ -240,4 +246,4 @@ To bridge the gap between general knowledge and actionable agentic behavior, we
240
  | **Code Execution** | Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
241
  | **Git Operations** | Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding. |
242
  | **Chain-of-Thought** | Explicit `Synth-CoT` data to encourage internal reasoning before generating final answers. |
243
- | **File Operations** | Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation. |
 
21
  - flow-matching
22
  - 3d-rope
23
  - titok
24
+ - vidtok
25
  - dual-stream-attention
26
  - zero-shot-voice-cloning
27
  - bigvgan
 
108
 
109
  </div>
110
 
111
+ # ![Xoron-Dev Logo](assets/IMG_2925.PNG)
112
  **Xoron-Dev** is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a **Mixture of Experts (MoE)** backbone with DeepSeek-style shared expert isolation and integrates SOTA encoders (SigLIP-2 with TiTok + Dual-Stream Attention) and generators (MoE-DiT with Flow Matching) for comprehensive any-to-any capabilities.
113
 
114
  ## 🌟 Model Highlights
115
 
116
  * **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
117
+ * **Continuous-Scale Training:** Adaptive strategy samples ANY scale in range - images (128-384px), videos (128-320px), frames (8-24).
118
+ * **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **VidTokTokenizer** (full 3D VAE with 4x8x8 compression) + **Temporal MoE** (4 experts) for video (8-24 frames).
119
+ * **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output: 192-384px, 50 inference steps.
120
+ * **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-24 frames @ 128-320px.
121
  * **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
122
  * **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
123
  * **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
 
153
  | Position Encoding | 2D-RoPE |
154
  | Output Tokens | 64 tokens per image |
155
 
156
+ ### 🎬 Video Encoder (3D Causal Transformers + VidTok)
157
  | Feature | Description |
158
  |---------|-------------|
159
+ | Frame Range | 8-24 frames (continuous-scale) |
160
+ | Resolution Range | 128-320px (continuous-scale) |
161
  | Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
162
+ | VidTokTokenizer | Full 3D VAE (Microsoft VidTok architecture) |
163
+ | Compression | 4x temporal, 8x8 spatial (4x8x8 total) |
164
+ | Architecture | 2D+1D efficient design with AlphaBlender |
165
+ | Quantization | Continuous (KL) or Discrete (FSQ) |
166
  | Attention | 3D Causal Self-Attention |
167
  | Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
168
  | Encoder Layers | 4 layers |
 
172
  |---------|-------------|
173
  | Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
174
  | Scheduler | **Flow Matching** (not DDPM) |
175
+ | Output Resolution | 192-384px (continuous-scale, step=32) |
176
  | Position Encoding | 2D-RoPE |
177
  | Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
178
  | MoE Experts | 4 experts in DiT blocks |
 
182
  ### πŸ“Ή Video Generation (3D Causal + Flow Matching)
183
  | Feature | Description |
184
  |---------|-------------|
185
+ | Output Resolution | 128-320px (continuous-scale, step=32) |
186
+ | Output Frames | 8-24 frames (continuous-scale, step=4) |
187
  | Scheduler | **Flow Matching** |
188
  | Position Encoding | **3D-RoPE** for (x, y, t) |
189
  | Attention | Factorized Spatial-Temporal (3D Causal) |
190
  | Expert Routing | **Temporal MoE** (4 experts) |
191
  | Guidance Scale | 7.5 (CFG) |
192
 
193
+ ### πŸ“ Continuous-Scale Training Configuration
194
+ | Type | Range | Base | Step |
195
+ |------|-------|------|------|
196
+ | **Image** | 128-384px | 256px | 32px |
197
+ | **Video** | 128-320px | 192px | 32px |
198
+ | **Frames** | 8-24 | 16 | 4 |
199
 
200
+ Continuous-scale training is **enabled by default** with **adaptive** strategy - dynamically adjusts scale ranges based on OOM history for optimal memory usage.
201
 
202
  ### 🎀 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
203
  | Feature | Description |
 
246
  | **Code Execution** | Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
247
  | **Git Operations** | Simulated version control tasks including committing, handling diffs, resolving merge conflicts, and repository context understanding. |
248
  | **Chain-of-Thought** | Explicit `Synth-CoT` data to encourage internal reasoning before generating final answers. |
249
+ | **File Operations** | Document handling, FIM (Fill-in-Middle), and edit operations for precise file manipulation. |