Self Attention

Self attention is built on top of attention module, where $Q,K,V$ matrices all come from the same input, but is multplied by different transformation matrices.

Suppose we have input sequence $X\in\mathbb{R}^{n\times d_{\text{model}}}$ , which means a sequence of $n$ tokens with each token has an embedding dimension $d_{\text{model}}$ .

We then first transform $X$ into $Q,K,V$ respectively with 3 matrices, $W^Q,W^K\in\mathbb{R}^{d_{\text{model}} \times d_k}$ and $W^V\in\mathbb{R}^{d_{\text{model}}\times d_v}$ . Now that $Q,K\in\mathbb{R}^{n\times d_k}$ and $V\in\mathbb{R}^{n\times d_v}$ , meaning that input sequence $X$ are projected into 2 “spaces” ( $Q,K$ space and $V$ space).

Then the rest is same as naive attention:

We first compute $QK^\top$ , then divided by $\sqrt{d}$ , and then $\text{softmax}()$ . The resulting matrix is a score matrix (attention weights) $P\in\mathbb{R}^{n\times n}$ , where $i$ -th row represents the similarity between $i$ -th token and all the tokens in the sequence.
We then compute $PV$ to get “meaning interpolation” for each row (i.e., each token). The resulting matrix is $O\in\mathbb{R}^{n\times d_v}$ encodes the “meaning” of each token (i.e., each row)