1. Explicit Cross GPU Memcpy + Per-Device Operation
Use DMA for P2P data transmission, otherwise use CPU for hopping.
cudaSetDevice()set GPU devicecudaDevCanAccessPeer()
2. Async P2P Memcpy
cudaEvent for async management and monitoring GPU.
NCCL
We can use NCCL for inter- or intra- node communication.
Rank: unique process identifier for each device.
Collective Communication
AllReduce operation
AllReduce performs a reduction on the input data of each rank, and saves the final result to the output buffer of each rank.