The core challenge that this paper solves is that, to build a theoretical foundation on LLM quantization from the aspect of rate-distortion theory. Moreover, design an optimization method to handle the disadvantage in bit allocation (mixed precision) and quant error correction. At last, perform optimal post-train quantization to satisfy users’ requirement on model size and precision requirement.
RADIO Method
The RADIO models quantization problem into a constrained least square optimization.
- Goal. Given model’s average bit rate (i.e. average number of bits of each weight), minimize the error of model’s outputs before and after quantization.
- Design variable. The bit depth and step size of each weight matrix, or more fine-grained weight group.
- Constraint. Total number of bits equals target bit rate multiplies by total number of weights.
We first model the NLP task as next-token prediction:
where denotes a sequence of tokens and each tokens has embedding dimension . And denotes -th transformer block.
We further define the quantizaton schema. Suppose bit depth and step width . The quantized weight is
Per-group quantization. Since it’s impractical to determine for each single weight. A practical way is to divide weights into groups and quantize each group.
Per-Group Quantization
A pair is used to quantize a small group weights. By experiment, “smaller group” is better set to one row of a matrix.
Bit depth assignment. Suppose the whole is divided into groups, and is used to quantize each group. How to figure out the closed form of optimal pair?
We may form an optimization problem for this. Suppose our target compression requires the average bits is , i.e., each weight has bits on average, and each group has params. Then adopting the idea from rate-distortion theory that we want to minimize the information lost after quantization, that is to say
If we apply Lagrange Multiplier Method to this optimization problem, and find partial derivatives w.r.t. , we have
So we can solve the optimization problem by alternatively update primal variables () and dual variable (), which is called Dual Ascent.
The next step is to estimate the gradient of