Multi-Head Attention
The motivation of multi-head attention is to “capture information between tokens through different aspects” with the design of "head"s.

Multi-Head Self-Attention
This is just an extension to self-attention that enables capture information between tokens of same sequence.
