To embed positional information into token embeddings, we want the following property

fq(xm,m)fk(xn,n)=g(xm,xn,mn)f_q(\mathbf{x}_m,m)^\top f_k(\mathbf{x}_n,n)=g(\mathbf{x}_m,\mathbf{x}_n,m-n)

An idea that is used in vanilla positional embedding (in Attention paper) is to use angles to encode positional index, i.e. f(mθ,nθ)=g((mn)θ)f'(m\theta,n\theta)=g'((m-n)\theta). And we can use trigonometric functions.

In cases that word embedding dimension is 2, i.e., d=2d=2, we can use the 2D rotation matrix.

Rm=(cosmθsinmθsinmθcosmθ)\mathbf R_m=\begin{pmatrix}\cos m\theta & -\sin m\theta\\\sin m\theta & \cos m\theta\end{pmatrix}

The rotation matrix satisfies that RmRn=Rmn\mathbf R_m^\top\mathbf R_n=\mathbf R_{m-n}, thus if we let

fq(xm,m)=RmWqxm,fk(xn,n)=RnWkxnf_q(\mathbf x_m,m)=\mathbf R_m\mathbf W_q\mathbf x_m, f_k(\mathbf x_n,n)=\mathbf R_n\mathbf W_k\mathbf x_n

Then we have

fqfk=(RmWqxm)RnWkxn=(Wqxm)Rmn(Wkxn)\begin{aligned} f_q^\top f_k&=\Big( \mathbf R_m\mathbf W_q\mathbf x_m \Big)^\top \mathbf R_n\mathbf W_k\mathbf x_n\\ &=(\mathbf W_q\mathbf x_m)^\top \mathbf R_{m-n} (\mathbf W_k\mathbf x_n) \end{aligned}


Thus, in the case that dd is large and dd is even, we can split dd into d/2d/2 pairs and apply the 2D case for each pair.

Note that for each pair, we have to assign a unique θi,i=1,2,,d/2\theta_i, i=1,2,\dots,d/2. But for d/2d/2 pairs of the same token, they have same mm value in the Rm\mathbf R_m matrix.

RoPE
RoPE

In practice, it’s not efficient to chunk adjacent elements. Instead, we usually split dd elements into first half and second half which are all continuous.