Sage Attention v3
Sage Attention v3,相比之前的两份工作更进一步,提出了 FP4 推理和 INT8 训练框架。
Sage Attention v3,相比之前的两份工作更进一步,提出了 FP4 推理和 INT8 训练框架。
第二版 Sage Attention 以及其改良
将低精度方法应用在 Flash Attention 上,computation pattern 和 Flash Attention 是一样的,整体的提速主要来自于低精度计算的提速减去量化的 overhead,当然同时也保证了一定的精度.
MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.
Self attention is built on top of attention, meant to capture relationship between tokens in the same sequence.
The foundation of the whole AI.
看 CUDA 代码真是一种享受啊(呕