Radio: Rate-Distortion Optimization for LLM Compression

The core challenge that this paper solves is that, to build a theoretical foundation on LLM quantization from the aspect of rate-distortion theory. Moreover, design an optimization method to handle the disadvantage in bit allocation (mixed precision) and quant error correction. At last, perform optimal post-train quantization to satisfy users’ requirement on model size and precision requirement.

RADIO Method

The RADIO models quantization problem into a constrained least square optimization.

Goal. Given model’s average bit rate (i.e. average number of bits of each weight), minimize the error of model’s outputs before and after quantization.
Design variable. The bit depth $B_n$ and step size $D_n$ of each weight matrix, or more fine-grained weight group.
Constraint. Total number of bits equals target bit rate multiplies by total number of weights.

We first model the NLP task as next-token prediction:

\mathbf{Z}=f_{[\Theta_1,\dots\Theta_N,\mathbf{b}_1,\dots\mathbf{b}_N]}(\mathbf{X})

where $\mathbf{X} \in\mathbb{R}^{L\times E}$ denotes a sequence of $L$ tokens and each tokens has embedding dimension $E$ . And $\Theta_{mM}+1, \dots\Theta_{(m+1)M}, \mathbf{b}_{mM+1},\mathbf{b}_{(m+1)M}$ denotes $m$ -th transformer block.

We further define the quantizaton schema. Suppose bit depth $B$ and step width $D$ . The quantized weight is

\theta^q(B,D)=D\cdot( \texttt{clip}(\frac{\theta}{D}, -2^{B-1}, 2^{B-1}-1) +0.5 )

Per-group quantization. Since it’s impractical to determine $B,D$ for each single weight. A practical way is to divide weights into groups and quantize each group.

Per-Group Quantization

A $(B,D)$ pair is used to quantize a small group weights. By experiment, “smaller group” is better set to one row of a matrix.

Bit depth assignment. Suppose the whole is divided into $n$ groups, and $(B_n,D_n)$ is used to quantize each group. How to figure out the closed form of optimal $(B_n,D_n)$ pair?

We may form an optimization problem for this. Suppose our target compression requires the average bits is $R$ , i.e., each weight has $R$ bits on average, and each group has $P_n$ params. Then adopting the idea from rate-distortion theory that we want to minimize the information lost after quantization, that is to say

\begin{array}{rlll} \min & d(\{ B_n \}) = \mathbb{E}_{\mathbf X} \Big\| f_{[\Theta^q(B_n,D_n)]}(\mathbf X) - f(\mathbf X) \Big\|^2 \\ \\ \text{s.t.} & \sum_i P_iB_i - R(\sum_i P_i) = 0 \end{array}

If we apply Lagrange Multiplier Method to this optimization problem, and find partial derivatives w.r.t. $B_i$ , we have

\frac{1}{P_n} \frac{\partial d(\{ B_n \})}{\partial B_n} = -\lambda

So we can solve the optimization problem by alternatively update primal variables ( $B_n$ ) and dual variable ( $\lambda$ ), which is called Dual Ascent.

The next step is to estimate the gradient of $\partial d(\cdot)/\partial B_n$

Radio: Rate-Distortion Optimization for LLM Compression

RADIO Method

Experiment