The core challenge that this paper solves is that, to build a theoretical foundation on LLM quantization from the aspect of rate-distortion theory. Moreover, design an optimization method to handle the disadvantage in bit allocation (mixed precision) and quant error correction. At last, perform optimal post-train quantization to satisfy users’ requirement on model size and precision requirement.

RADIO Method

The RADIO models quantization problem into a constrained least square optimization.

  • Goal. Given model’s average bit rate (i.e. average number of bits of each weight), minimize the error of model’s outputs before and after quantization.
  • Design variable. The bit depth BnB_n and step size DnD_n of each weight matrix, or more fine-grained weight group.
  • Constraint. Total number of bits equals target bit rate multiplies by total number of weights.

We first model the NLP task as next-token prediction:

Z=f[Θ1,ΘN,b1,bN](X)\mathbf{Z}=f_{[\Theta_1,\dots\Theta_N,\mathbf{b}_1,\dots\mathbf{b}_N]}(\mathbf{X})

where XRL×E\mathbf{X} \in\mathbb{R}^{L\times E} denotes a sequence of LL tokens and each tokens has embedding dimension EE. And ΘmM+1,Θ(m+1)M,bmM+1,b(m+1)M\Theta_{mM}+1, \dots\Theta_{(m+1)M}, \mathbf{b}_{mM+1},\mathbf{b}_{(m+1)M} denotes mm-th transformer block.

We further define the quantizaton schema. Suppose bit depth BB and step width DD. The quantized weight is

θq(B,D)=D(clip(θD,2B1,2B11)+0.5)\theta^q(B,D)=D\cdot( \texttt{clip}(\frac{\theta}{D}, -2^{B-1}, 2^{B-1}-1) +0.5 )

Per-group quantization. Since it’s impractical to determine B,DB,D for each single weight. A practical way is to divide weights into groups and quantize each group.

Per-Group Quantization

A (B,D)(B,D) pair is used to quantize a small group weights. By experiment, “smaller group” is better set to one row of a matrix.

Bit depth assignment. Suppose the whole is divided into nn groups, and (Bn,Dn)(B_n,D_n) is used to quantize each group. How to figure out the closed form of optimal (Bn,Dn)(B_n,D_n) pair?

We may form an optimization problem for this. Suppose our target compression requires the average bits is RR, i.e., each weight has RR bits on average, and each group has PnP_n params. Then adopting the idea from rate-distortion theory that we want to minimize the information lost after quantization, that is to say

mind({Bn})=EXf[Θq(Bn,Dn)](X)f(X)2s.t.iPiBiR(iPi)=0\begin{array}{rlll} \min & d(\{ B_n \}) = \mathbb{E}_{\mathbf X} \Big\| f_{[\Theta^q(B_n,D_n)]}(\mathbf X) - f(\mathbf X) \Big\|^2 \\ \\ \text{s.t.} & \sum_i P_iB_i - R(\sum_i P_i) = 0 \end{array}

If we apply Lagrange Multiplier Method to this optimization problem, and find partial derivatives w.r.t. BiB_i, we have

1Pnd({Bn})Bn=λ\frac{1}{P_n} \frac{\partial d(\{ B_n \})}{\partial B_n} = -\lambda

So we can solve the optimization problem by alternatively update primal variables (BnB_n) and dual variable (λ\lambda), which is called Dual Ascent.

The next step is to estimate the gradient of d()/Bn\partial d(\cdot)/\partial B_n


Experiment