Native FP8 Mixed Precision Training for Ling 2.0, Open Sourced!
Open source repos:
- Github: https://github.com/inclusionAI/Ling-V2/
- Huggingface: https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86
- Modelscope: https://www.modelscope.cn/collections/Ling-V2-01d8988fbf864d
TL; DR
This open-sourced Ling 2.0 natively adopts FP8 precision for training. In our continuous pursuit of maximizing FP8's cost efficiency, we have achieved the following breakthroughs:
- Near-lossless Model Quality: Finer-grained quantization mitigates the impact of outliers on overall quantization accuracy, ensuring the loss convergence trajectory remains consistent with BF16 training. Additionally, fine-grained quantization enables backward propagation to utilize higher-precision FP8 E4M3 format instead of E5M2, further reducing numerical precision loss.
- Enhanced Framework Efficiency: The memory advantages of FP8 expand opportunities for model partitioning and recomputation techniques, thereby boosting overall training throughput. Building upon FP8 tile/block-wise dynamic scaling, we introduced novel optimizations including FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map. Benchmark tests on 8/16/32 × 80G GPU clusters demonstrate that Ling-mini-2.0 achieves:
- 30–60% throughput gain over LLaMA 3.1 8B and Qwen3 8B with MTP enabled
- 90–120% throughput gain with MTP disabled.
In recent years, large model technology has advanced rapidly, with model parameters and training data volume growing exponentially. This has posed unprecedented challenges to computational resources, memory bandwidth, and energy consumption. Reducing computational and storage overhead has become critical for enhancing R&D efficiency and lowering inference deployment costs. While low-precision training reduces memory footprint and boosts computational efficiency, fewer bits struggle to represent values precisely—leading to sluggish loss convergence and degraded benchmark performance—which are new issues demanding solutions.
The newly open-sourced Ling 2.0 series models natively adopt FP8 hybrid-precision training, and we have concurrently open-sourced a complete FP8 training solution. This marks the industry’s first fully open-source solution supporting FP8 hybrid-precision training for MoE models, delivering out-of-the-box usability.
How to mitigate FP8 precision disadvantages?
Where do precision concerns originate?
FP8 represents floating-point numbers using just 8 bits. Compared with FP32/FP16/BF16, the constrained bitwidth forces trade-offs between exponent and mantissa bits, directly impacting FP8's numerical range and representation fidelity.
As shown below, current community standards gravitate toward two FP8 formats: E4M3 (4 exponent/3 mantissa bits) and E5M2 (5 exponent/2 mantissa bits). While E4M3 gains one extra mantissa bit for improved precision, it sacrifices exponent width—resulting in a narrower dynamic range than E5M2.
Compared with BF16 (E8M7), FP8's dynamically narrower spectrum and lower numerical fidelity can lead to sluggish loss convergence ("inaccurate computation") and degraded model performance in benchmarks. To represent tensors like weights/activations/gradients during training, we thus represent them with an FP8 tensor collaborating with a scale tensor—a matrix or scalar that occupies minimal memory but has a greater span of magnitudes.
The process of converting a native BF16 tensor to an (FP8 tensor + scale tensor) pair is called quantization, and its reverse is dequantization. Within this paradigm, existing BF16 Linear layers can be seamlessly converted to FP8 Linear as shown in the figure below.
Where does the error come from?
In exploring low-precision training, we attribute FP8 errors to three sources:
- FP8 quantization underflow (non-zero matrix elements quantized to zero)
- FP8 range collapse (distinct pre-quantization values becoming identical after quantization)
- GEMM precision deviation with FP8 inputs
Among these, the GEMM kernel’s error from maintaining FP32-precision accumulators—listed in point 3—is negligible for training convergence. The root causes lie chiefly in points 1 and 2.
To safeguard Ling 2.0 training, we thus deployed monitors for FP8 range collapse▼underflow and high-precision recomputation drift, triggering real-time alerts and interventions when precision anomalies arise.
Ling-mini 2.0 Training Solution
Coarse-Grained Quantization (Per-Tensor/Channel-wise Scaling)
Early FP8 training schemes typically employed per-tensor-wise or per-channel-wise scaling. In such methods, the quantization scale proves vulnerable to matrix outliers, thereby aggravating quantization errors and hampering loss convergence. During late stages of large-scale training, this vulnerability often manifests as slowed convergence or even rising loss—issues difficult to quickly resolve through precision boosts. Consequently, these FP8 approaches struggle to safeguard training for ultra-large LLMs.
Fine-Grained Quantization (Per-Tile/Block-wise Scaling)
Following DeepSeek V3, Ling-mini-2.0 also adopts a finer-grained tile/block-wise scaling strategy. By partitioning tensors into tiles/blocks and maintaining a dedicated scale per block, this method largely prevents outliers from distorting global quantization accuracy. Thereby it paves the way for training ultra-large LLMs entirely in FP8 precision.
Ling-mini 2.0 Experimental Results
BF16 vs. FP8 Loss Comparison
Implemented with the methodology described herein, the figure below displays the loss differential between BF16 and FP8 precision training in Ling-mini-2.0 architecture. After stable convergence, FP8 consistently maintains a relative error of about 0.001 compared to BF16 throughout training, with the loss gap showing no sign of rising.
FP8 Underflow / Distortion Metrics
According to our FP8 underflow/distortion monitoring, both activations and gradients maintain exceptionally low underflow rates throughout training. This guarantees the reliability of FP8 computation results in both forward propagation and backward propagation (dx).
However, monitoring reveals higher quantization errors in the top-layer gradients (transposed) that drive dw during backpropagation. After multi-perspective analysis and validation, we’ve determined that this error 1) causes no observable harm to model training and 2) will not accumulate across layers since dw resides at a leaf node where quantization errors do not propagate.
Unlocking FP8’s performance benefits
Motivation
With a robust training recipe secured, we shift our focus to FP8’s performance advantages. Given that most individual users lack abundant computing resources, we aim to leverage FP8’s lower GPU memory footprint alongside optimizing CPU overhead to deliver an efficient mixed-precision training solution. This enables users with limited resources (8~32×80GB GPUs) to achieve higher throughput during continued training of Ling-mini-2.0 than with sub-10B dense models:
- Reduced GPU memory overhead: FP8 halves memory usage versus BF16, enabling flexible tuning of micro batch sizes / tensor parallelism (TP), pipeline parallelism (PP) strategies, and gradient checkpointing. Our fine-grained quantization also ensures safe LLM training—implying weights, activations, gradients, and optimizer states can be "compressed" to FP8 and decompressed with near-lossless fidelity. This provides a viable speed-space tradeoff.
- CPU overhead optimization: Current FP8 implementations in Megatron + Transformer Engine introduce excessive ops—e.g., FP8 quantize/dequantize, FP8 padding/unpadding, and redundant validation checks—all potential bottlenecks to training efficiency.
Superior Framework Execution
FP8 Optimizer
Adam keeps two extra states per parameter (first and second order moments) doubling the optimizer memory footprint. Once the model outgrows the rack, those moments become the next wall. Storing them in FP8 trims 75 % of that footprint, turning “out-of-memory” into “room for a larger model or a fatter micro-batch”.
Fine-grained tile-wise compress/decompress keeps convergence and downstream metrics indistinguishable from the full-precision baseline; the moments are naturally smoothed, so quantization noise is washed out.
_Implementation inspiration: _8-bit Optimizers via Block-wise Quantization
FP8 On-Demand Transpose Weight
Because the transpose operation on an FP8 tensor is relatively slow, the original implementation of Transformer‑Engine caches an extra copy of weight.T for the backward pass in order to improve efficiency. This extra cache means that the overall memory consumption of the weight tensor is not reduced.
Against this background, we developed a faster transpose kernel and implemented on‑demand transposition, removing the need to store the additional transposed weight. As a result, the memory occupied by the weight tensor is cut by roughly 50 %.
Committed: blockwise fp8 weight memory optimization: on-demand columnwise fp8 weight creation
FP8 Routing Map Padding
FP8 gemm kernel requires matrix shapes to be multiples of 16, which is hard to guarantee for the token‑count per expert in Mixture‑of‑Experts (MoE) models.
The original Megatron framework solves this by inserting an extra padding/unpadding layer, but doing so introduces additional CPU overhead. To eliminate unnecessary latency, we modify the routing map before the routing step so that the resulting matrix shape already conforms to the FP8 gemm kernel’s requirements. Because the edited region corresponds to router probabilities of 0, the overall computation remains mathematically equivalent, thereby further improving training efficiency.
Merged into Megatron: FP8 padding optimization of MoE models by padding the routing map.
Benchmark Performance
Above FP8 techniques save 14-16GB per-GPU VRAM versus BF16 baseline—enough headroom to crank up the micro-batch size and push system throughput. Benchmarks across 8/16/32 × 80GB GPU clusters show:
- With MTP activated, Ling-mini-2.0 hits 30–60% higher throughput vs. LLaMA 3.1 8B and Qwen3 8B
- With MTP deactivated, throughput spikes by 90–120%
| Model | 8 x 80G GPUs (GBS=128) | 16 x 80G GPUs (GBS=256) | 32 x 80G GPUs (GBS=512) |
|---|---|---|---|
| LLaMA 3.1 8B (baseline) | 81222 | 161319 | 321403 |
| Qwen3 8B | 55775 (-31.33%) | 109799 (-31.94%) | 219943 (-31.57%) |
| Ling-mini-2.0 | 109532 (+34.86%) | 221585 (+37.36%) | 448726 (+39.61%) |
| Ling-mini-2.0 w/o MTP | 128298 (+57.96%) | 307264 (+90.47%) | 611466 (+90.25%) |
In our hunt for a cheaper-yet-better training pipeline than BF16, “bang-for-buck” boils down to one line: lower loss or a higher leaderboard score for the same wall-clock time.
Tiny wins can be scratched out here and there, yet experience shows that squeezing out real speed is a long, grinding road—once observable accuracy is lost, it takes even larger, longer cycles to buy it back. For the Ling family we therefore set a non-negotiable bar: zero-degradation first; only after that do we keep chasing flops.
Today, thanks to 3D parallelism, any transformer-based language model (no matter the size) can be spread across GPUs with TP/PP/CP sharding. Low precision adds another dial that trades compute for memory, so under a fixed resource cap we must keep juggling compute efficiency, communication volume and recomputation to push throughput ever higher. That balancing act is a journey without end.








