CUDA Data Transmission

Async Memory Move

cudaMemcpy(), cudaMalloc() are explicitly synchronized. To make them async, use cudaMallocAsync(), cudaMemcpyAsync(), cudaFreeAsync() via streams.

Pageable Memory vs. Pinned Memory

Pageable Memory has greater latency compared to pinned memory, which is smaller but is page-locked, meaning GPU can directly access it with DMA without copying. This is also called zero-copying technique.

To allocate pinned memory, use cudaMallocHost() and cudaFreeHost(). We can apply async memory copy techniques.

In order to use pinned memory data in GPU,

first use cudaHostAlloc(&pmem_address, size, cudaHostAllocMapped) with flag cudaHostAllocMapped to map pinned memory onto GPU.
then, we retrieve device side data pointer with cudaHostGetDevicePointer(&device_address, host_address, 0).

Auto Manage Memory

Unified Memory: CPU, GPU memory looks like same memory. In CUDA, it’s called managed memory (cuda automatically manages for us).

1
2
3

// h_data is host side pointer
cudaMallocManaged(&h_data, size); // allocate to same memory
cudaFree(h_data);

cudaMallocManaged() use Page Fault to transmit data.

Speed-Up Managed Memory

Useful since pinned memory is quite small. Besides, cuda managed memory provides some magic to speed up.

cudaMemAdvice() gives cuda heuristics to tell cuda some information to help cuda determine what strategy and optimization to use.
cudaMemPrefetchAsync() makes cuda prefetch data async.