[Paper] Deepseek FP8 训练方案

Overview

Deepseek v3 实现了 FP8 的 GEMM 算子，但是考虑到有一些算子需要在计算时保留足够的精度，Deepseek v3 在 embedding 层、output head、MoE Gating、Normalization、Attention 模块上保持了全精度或半精度计算（FP32 or BF16），也就是只有线性层采用了 FP8 进行矩阵乘法。

Deepseek v3 的训练采用混合精度框架，Master Weights, Weight Gradients, Optimizer State 仍然选择用更高精度保存精度。

Framework

以上是 Linear 层的 FP8 量化计算。在前向传播 $\texttt{Fprop}$ 中：

上一层的 Activation 以 BF16 进入流程，先被量化成 FP8
储存着的 Master Weight (FP32) 也是先被量化到 FP8，进入 FP8 GEMM 算子
FP8 GEMM 算子计算出的中间结果用 FP32 进行累加，计算出中间结果矩阵
再将中间结果矩阵量化为 BF16 进入后续流程

对于 Weight 反向传播的参数更新 $\texttt{Wgrad}$ ：

上一层的 Activation 虽然以 BF16 的格式在算子之间流动，但是我们保存一份 FP8 的副本，用于权重的更新计算
现在假设从上游传来了 Output Gradient (BF16)，我们把 BF16 的 Output Gradient 先量化成 FP8
接着，和保存的 FP8 Activation Copy 用 FP8 GEMM 计算中间结果，同样用 FP32 进行累加，计算出中间 Weight Gradient 矩阵 (FP32)
将 FP32 Weight Gradient 转化为 BF16 用于更新 Optimizer State (BF16)
再将 BF16 的 Optimizer State 转化为 FP32 用于更新 Master Weight (FP32)

接着，通过链式法则传播梯度 $\texttt{Dgrad}$ ：

依旧是将上游传递过来的梯度 (BF16) 量化为 FP8
将 Master Weight (FP32) 也量化为 FP8
进行 FP8 GEMM 计算，并用 FP32 进行累加
将输出的 FP32 梯度转化为 BF16，传播到下游

Fine-grained Quantization

Challenge:

underflow & overflow
presence of outliers in activation, weights and gradients.

tile-wise $1\times N_c$ grouping for activation (activation 不能用 block 量化)

block-wise $N_c\times N_c$ grouping for weights => extend dynamic range of FP8. [ALSO] dequant overhead

why different group size of activation and weight? 当 activation 也用 $N_c \times N_c$ block-wise 量化时，Dgrad 对精度很敏感，且导致 model divergence on MoE <= “activation gradients are highly imbalanced among tokens, resulting in token-correlated outliers”

Accumulation Precision

accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits. 在 Hopper GPU 的 Tensor Core 上，FP8 GEMM 算子的累加精度只用 $14$ bits，导致精度损失．当 $K$ 变大时，问题更严重．

分段式加法：在 Tensor Core 上计算矩阵的部分和时，用低精度；多个部分和（设置为 $N_c$ 个）加起来计算出结果矩阵的值的时候，转移到 CUDA Core 上用 FP32 累加

流水线并行。GEMM 计算分两步：第一步把 master weight 给量化成 FP8 + 执行 FP8 GEMM，第二步把 FP8 Result 转移到 CUDA Core 进行 FP32 加法．这两步可以做流水线并行，进行指令重叠，maintain high utilization of Tensor Cores.

Mantissa over Exponents

统一采用 FP8 E4M3 格式以保留更多精度信息，对于实施 fine-grained quantization 也很重要：可以有效地在 small group 里共享 exponents.

Online Quantization

compute absmax online and conduct online quantization