[Paper] Merge Then Compress

Methodology

这篇论文提出的算法一共分为三个部分: Expert Alignment, Expert Merge, Expert Compression. 接下来就分别讲解一下这三个部分

Expert Alignment

从向量空间的角度解释，但似乎是错误的理解

我们使用 MoE 模型的初衷其实就是希望“每一个 Expert 学习到不同领域的知识”．这是一个比较感性的说法．如果我们从词向量所在的向量空间来看的话，那么“不同领域的知识”其实就相当于不同的向量空间，同一个 token 在不同向量空间里的词向量也是不同的，也就达到了我们“学习不同领域的知识”这一目的．我们可以把 vector space 看作是某种 semantic space.

主流 MoE 大模型的 export 本质也还是 FFN．假设 expert $E_i$ 的权重包含 $W_{in},W_{out}$ ，那么 expert 的计算流程可以概括为

\boxed{\boldsymbol{x_{in}}}\underset{W_{in}}{\longrightarrow} \boxed{\text{Word Embedding of \(\boldsymbol{x}\) in Vector Space Specific to \(E_i\)}}\underset{W_{out}}{\longrightarrow}\boxed{\boldsymbol{x_{out}}}

用数学表示就是

E_i:X_{in}\mapsto X_{out}\\ E_i=W_{out}\texttt{act}(W_{in}\boldsymbol{x})

从上面的图示就可以看出端倪．当我们想 merge experts 的时候，我们发现，他们的参数代表之含义完全不同：同一个 token 会被 map 到不同 semantic space，从语义的角度来说，直接加权平均这些 experts 的参数似乎就站不住脚了．所以，弥补这个漏洞，我们需要先确保每一个 expert 会将同一个 token map 到同一个 semantic space，这一步就是 expert alignment.

具体来说，我们为每一个 expert 找一个 permutation matrix $P$ 使得 expert 的最终输出仍然不变，但是在计算过程中，token 被 map 到相同的 semantic space 中：

permutation matrix $P$ , which is a square matrix where each row and column has exactly one element of $1$ , with all other elements being $0$

W_{out}\texttt{act}(W_{in}\boldsymbol{x})=W_{out}\textcolor{lime}{P^\top}\texttt{act}(\textcolor{lime}{P}W_{in}\boldsymbol{x})

这一步在数学上成立，于是

\boxed{\boldsymbol{x_{in}}}\underset{\textcolor{lime}{P}W_{in}}{\longrightarrow} \boxed{\text{Word Embedding of \(\boldsymbol{x}\) in Unified Semantic Space}}\underset{W_{out}\textcolor{lime}{P^\top}}{\longrightarrow}\boxed{\boldsymbol{x_{out}}}