Multi-Head Attention

The motivation of multi-head attention is to “capture information between tokens through different aspects” with the design of "head"s.

Multi-Head Self-Attention

This is just an extension to self-attention that enables capture information between tokens of same sequence.