Async Memory Move
cudaMemcpy(), cudaMalloc() are explicitly synchronized. To make them async, use cudaMallocAsync(), cudaMemcpyAsync(), cudaFreeAsync() via streams.
Pageable Memory vs. Pinned Memory
Pageable Memory has greater latency compared to pinned memory, which is smaller but is page-locked, meaning GPU can directly access it with DMA without copying. This is also called zero-copying technique.
To allocate pinned memory, use cudaMallocHost() and cudaFreeHost(). We can apply async memory copy techniques.
In order to use pinned memory data in GPU,
- first use
cudaHostAlloc(&pmem_address, size, cudaHostAllocMapped)with flagcudaHostAllocMappedto map pinned memory onto GPU. - then, we retrieve device side data pointer with
cudaHostGetDevicePointer(&device_address, host_address, 0).
Auto Manage Memory
Unified Memory: CPU, GPU memory looks like same memory. In CUDA, it’s called managed memory (cuda automatically manages for us).
1 | // h_data is host side pointer |
cudaMallocManaged() use Page Fault to transmit data.
Speed-Up Managed Memory
Useful since pinned memory is quite small. Besides, cuda managed memory provides some magic to speed up.
cudaMemAdvice()gives cuda heuristics to tell cuda some information to help cuda determine what strategy and optimization to use.cudaMemPrefetchAsync()makes cuda prefetch data async.