To start with, we quickly go through what prefill and decode is.

Prefill

During prefill phase, the model reads and processes the entire input prompt in parallel. In detail, the input are tokenized and KV Cache are generated correspondingly to support subsequent autoregressive token generation. And outputs the first token.

The main characteristic is that it’s computationally intensive and relatively low memory usage.

Decode

During the decoding phase, the model enters autoregressive token generation procedure. In each decoding run, it accepts a single token as input, computes attention with generated KV Cache, and finally computes probability distribution over vocabulary. The model then samples the generated token from this distribution and appends this generated token to the KV Cache. The generated token is again used as input in the next run.

The main characteristic is that the memory usage increases as more tokens are “decoded”/generated, and is highly sequential.

Prefill-Decode Disaggregation

Since the prefill and decode are heterogeneous, it’s natural for us to separate these two stages. We deploy two stages of model in different hardware resource pool to better match their computing characteristic.