arxiv:2512.23236

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Published on Dec 29

· Submitted by

Gang Liao on Dec 30

Meta Research

Upvote

Authors:

Abstract

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

View arXiv page View PDF Add to collection

Community

gangliao

Paper submitter about 3 hours ago

•

edited about 3 hours ago

Excited to share our recent work on KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s latest-generation AI accelerators (MTIA v3).

Writing high-performance GPU kernels is a complex challenge that typically demands years of deep expertise and remains a major focus of industry and academic research. It’s truly impressive to see KernelEvolve not only achieve state-of-the-art results on open benchmarks, but also deliver 1.25–17x speedups across Meta production use cases.

This milestone was made possible by outstanding collaboration across Meta—including teams from Monetization Infra and Ranking, FAIR, Compiler, MTIA, Serverless Compute, and more. Thank you to everyone for your dedication and teamwork in making this breakthrough happen!

You can read the full paper here:
👉 https://lnkd.in/gdPb43EZ

This is only ~1% of the journey. There is much more ahead in 2026 as we continue pushing the boundaries.

If your background aligns (Agentic, LLM, RL, AI compiler, Kernels, Inference/training optimization etc.) and you’re interested in joining us on this journey, feel free to DM me. We’re hiring. ([email protected])

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.23236 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.23236 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.23236 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.