Motivation
Shared Structure between Heads
Authors designed an experiment on testing whether a shared structure exists across different heads within a layer, as stated in various analysis (Michel et al., 2019; Voita et al., 2019; Yang et al., 2024) that heads in MHA often learn redundent patterns.
Design. For a given head at layer , we examine on a concatenated matrix (why?)
Then, we perform SVD to extract top- left singular vectors to form an orthonormal basis , which is used to project and then re-project, i.e. :
- self: the source matrix itself
- intra: another head from same layer
- inter: another head’s matrix from a different layer
The reconstruction quality is measured by
which shows that the larger the value is, the more similar between the two matrices, and thus more similar in terms of subspaces, which provides theoretical bases for sharing.
The experiment figure shows that
Insight
A basis learned from one or more heads capture nontrivial portions of the variation of other heads in the same layer, and that captured portion grows as more dimensions are allowed
Dense Core Bottleneck
TensorLLM uses Tucker decomposition, which factorizes a tensor into a set of compact factor matrices and a core tensor.
Methodology
Preprocessing
Stage 1
similar to finegrained granularity quantization (but on different dimensions)