Self Attention

只要理解了 Vanilla Attention 那么 Self Attention 也就不难理解了。所谓的 Self Attention 就是说，每一个 Token 都有 $3$ 个 vector: Query vector, Key vector, Value vector. 在 Self Attention 里，一个 Token 用自己的 Query vector 和所有 Token 的 Key vector 计算相似度，接着再和所有 Token 的 Value vector 进行加权和。

用数学描述的话就是，给定 $n$ 个 tokens $\bold{x}_1, \bold{x}_2,\dots, \bold{x}_n \in \R^d$ ，Self Attention 输出的 $\bold{y}_i$ 满足

\mathbf{y}_i = f(\mathbf{x}_i, \mathcal{D}) \in \mathbb{R}^d\\ \mathcal{D}=(\mathbf{x}_1, \mathbf{x}_1), \ldots, (\mathbf{x}_n, \mathbf{x}_n)

num_hidden, num_heads = 100, 5
attention = d2l.MultiHeadAttention(num_hidden, num_heads, 0.5)

y = attention(x, x, x, valid_lens=None)