Motivation

Shared Structure between Heads

Authors designed an experiment on testing whether a shared structure exists across different heads within a layer, as stated in various analysis (Michel et al., 2019; Voita et al., 2019; Yang et al., 2024) that heads in MHA often learn redundent patterns.

Design. For a given head ii at layer \ell, we examine on a concatenated matrix (why?)

H,i=[WQWKWV(WO)]Rdmodel×4dhead\mathbf{H}_{\ell,i}=[\mathbf{W}^Q|\mathbf{W}^K|\mathbf{W}^V|(\mathbf{W}^O)^\top]\in\mathbb{R}^{d_{\text{model}} \times 4d_{\text{head}}}

Then, we perform SVD to extract top-rr left singular vectors to form an orthonormal basis U(r)\mathbf{U}^{(r)}, which is used to project and then re-project, i.e. H=U(r)(U(r))H\mathbf{H}'=\mathbf{U}^{(r)}(\mathbf{U}^{(r)})^\top \mathbf{H}:

  1. self: the source matrix H,i\mathbf{H}_{\ell,i} itself
  2. intra: another head jj from same layer \ell
  3. inter: another head’s matrix from a different layer kk

The reconstruction quality is measured by

1E2,E=HHFHF1-E^2, E=\frac{\| \mathbf{H}-\mathbf{H}' \|_F}{\| \mathbf{H} \|_F}

which shows that the larger the value is, the more similar between the two matrices, and thus more similar in terms of subspaces, which provides theoretical bases for sharing.

The experiment figure shows that

Insight

A basis learned from one or more heads capture nontrivial portions of the variation of other heads in the same layer, and that captured portion grows as more dimensions are allowed

Dense Core Bottleneck

TensorLLM uses Tucker decomposition, which factorizes a tensor into a set of compact factor matrices and a core tensor.

Methodology

Preprocessing

Stage 1

similar to finegrained granularity quantization (but on different dimensions)

Stage 2

Inference without Reconstruction