Self attention is built on top of attention module, where Q,K,VQ,K,V matrices all come from the same input, but is multplied by different transformation matrices.

Suppose we have input sequence XRn×dmodelX\in\mathbb{R}^{n\times d_{\text{model}}}, which means a sequence of nn tokens with each token has an embedding dimension dmodeld_{\text{model}}.

We then first transform XX into Q,K,VQ,K,V respectively with 3 matrices, WQ,WKRdmodel×dkW^Q,W^K\in\mathbb{R}^{d_{\text{model}} \times d_k} and WVRdmodel×dvW^V\in\mathbb{R}^{d_{\text{model}}\times d_v}. Now that Q,KRn×dkQ,K\in\mathbb{R}^{n\times d_k} and VRn×dvV\in\mathbb{R}^{n\times d_v}, meaning that input sequence XX are projected into 2 “spaces” (Q,KQ,K space and VV space).

Then the rest is same as naive attention:

  1. We first compute QKQK^\top, then divided by d\sqrt{d}, and then softmax()\text{softmax}(). The resulting matrix is a score matrix (attention weights) PRn×nP\in\mathbb{R}^{n\times n}, where ii-th row represents the similarity between ii-th token and all the tokens in the sequence.
  2. We then compute PVPV to get “meaning interpolation” for each row (i.e., each token). The resulting matrix is ORn×dvO\in\mathbb{R}^{n\times d_v} encodes the “meaning” of each token (i.e., each row)