1. Explicit Cross GPU Memcpy + Per-Device Operation

Use DMA for P2P data transmission, otherwise use CPU for hopping.

  • cudaSetDevice() set GPU device
  • cudaDevCanAccessPeer()

2. Async P2P Memcpy

cudaEvent for async management and monitoring GPU.

NCCL

We can use NCCL for inter- or intra- node communication.

Rank: unique process identifier for each device.

Collective Communication

AllReduce operation

AllReduce performs a reduction on the input data of each rank, and saves the final result to the output buffer of each rank.