LeSTD: Learning-Based Sparse Tensor Decomposition

Motivation

Shared Structure between Heads

Authors designed an experiment on testing whether a shared structure exists across different heads within a layer, as stated in various analysis (Michel et al., 2019; Voita et al., 2019; Yang et al., 2024) that heads in MHA often learn redundent patterns.

Design. For a given head $i$ at layer $\ell$ , we examine on a concatenated matrix (why?)

\mathbf{H}_{\ell,i}=[\mathbf{W}^Q|\mathbf{W}^K|\mathbf{W}^V|(\mathbf{W}^O)^\top]\in\mathbb{R}^{d_{\text{model}} \times 4d_{\text{head}}}

Then, we perform SVD to extract top- $r$ left singular vectors to form an orthonormal basis $\mathbf{U}^{(r)}$ , which is used to project and then re-project, i.e. $\mathbf{H}'=\mathbf{U}^{(r)}(\mathbf{U}^{(r)})^\top \mathbf{H}$ :

self: the source matrix $\mathbf{H}_{\ell,i}$ itself
intra: another head $j$ from same layer $\ell$
inter: another head’s matrix from a different layer $k$

The reconstruction quality is measured by

1-E^2, E=\frac{\| \mathbf{H}-\mathbf{H}' \|_F}{\| \mathbf{H} \|_F}

which shows that the larger the value is, the more similar between the two matrices, and thus more similar in terms of subspaces, which provides theoretical bases for sharing.

The experiment figure shows that

Insight

A basis learned from one or more heads capture nontrivial portions of the variation of other heads in the same layer, and that captured portion grows as more dimensions are allowed

Dense Core Bottleneck

TensorLLM uses Tucker decomposition, which factorizes a tensor into a set of compact factor matrices and a core tensor.

LeSTD: Learning-Based Sparse Tensor Decomposition

Motivation

Shared Structure between Heads

Dense Core Bottleneck

Methodology

Preprocessing

Stage 1

Stage 2

Inference without Reconstruction