CUDA Multiple GPU

1. Explicit Cross GPU Memcpy + Per-Device Operation

Use DMA for P2P data transmission, otherwise use CPU for hopping.

cudaEvent for async management and monitoring GPU.

We can use NCCL for inter- or intra- node communication.

Rank: unique process identifier for each device.

AllReduce performs a reduction on the input data of each rank, and saves the final result to the output buffer of each rank.