Sage Attention v3

Sage Attention v3,相比之前的两份工作更进一步,提出了 FP4 推理和 INT8 训练框架。

Attention Study/Low Bit Optimization/Quantization/Low Bit

Sage Attention v2 与 v2++

第二版 Sage Attention 以及其改良

Attention Study/Low Bit Optimization/Quantization/Low Bit

Sage Attention v1: 对 Attention 的 INT8 PTQ

将低精度方法应用在 Flash Attention 上,computation pattern 和 Flash Attention 是一样的,整体的提速主要来自于低精度计算的提速减去量化的 overhead,当然同时也保证了一定的精度.

Attention Study/Low Bit Optimization/Quantization/Low Bit

Multi-Head Attention & Multi-Head Self-Attention

MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.

Attention Study/Foundation

Self Attention

Self attention is built on top of attention, meant to capture relationship between tokens in the same sequence.

Attention Study/Foundation

Naive Attention

The foundation of the whole AI.

Attention Study/Foundation

Flash Mask: 在 Flash Attention 上任意掩码以适配不同任务

看 CUDA 代码真是一种享受啊(呕

Attention Study