Rotary Position Embedding

To embed positional information into token embeddings, we want the following property

f_q(\mathbf{x}_m,m)^\top f_k(\mathbf{x}_n,n)=g(\mathbf{x}_m,\mathbf{x}_n,m-n)

An idea that is used in vanilla positional embedding (in Attention paper) is to use angles to encode positional index, i.e. $f'(m\theta,n\theta)=g'((m-n)\theta)$ . And we can use trigonometric functions.

In cases that word embedding dimension is 2, i.e., $d=2$ , we can use the 2D rotation matrix.

\mathbf R_m=\begin{pmatrix}\cos m\theta & -\sin m\theta\\\sin m\theta & \cos m\theta\end{pmatrix}

The rotation matrix satisfies that $\mathbf R_m^\top\mathbf R_n=\mathbf R_{m-n}$ , thus if we let

f_q(\mathbf x_m,m)=\mathbf R_m\mathbf W_q\mathbf x_m, f_k(\mathbf x_n,n)=\mathbf R_n\mathbf W_k\mathbf x_n

Then we have

\begin{aligned} f_q^\top f_k&=\Big( \mathbf R_m\mathbf W_q\mathbf x_m \Big)^\top \mathbf R_n\mathbf W_k\mathbf x_n\\ &=(\mathbf W_q\mathbf x_m)^\top \mathbf R_{m-n} (\mathbf W_k\mathbf x_n) \end{aligned}

Thus, in the case that $d$ is large and $d$ is even, we can split $d$ into $d/2$ pairs and apply the 2D case for each pair.

Note that for each pair, we have to assign a unique $\theta_i, i=1,2,\dots,d/2$ . But for $d/2$ pairs of the same token, they have same $m$ value in the $\mathbf R_m$ matrix.

RoPE

In practice, it’s not efficient to chunk adjacent elements. Instead, we usually split $d$ elements into first half and second half which are all continuous.