view article Article NEO-unify: Building Native Multimodal Unified Models End to End 11 days ago • 94
view article Article Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective Jan 27 • 67
Parallel Sentences Datasets Collection These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual. • 14 items • Updated Dec 10, 2025 • 21
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 57