Multi-Head Attention & Multi-Head Self-Attention
MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.
MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.
Self attention is built on top of attention, meant to capture relationship between tokens in the same sequence.
The foundation of the whole AI.