After cuda kernel issuance, it returns immediately and host continues to execute, i.e., device and host executions are async.

Async Pipelining in CUDA

memcpy_async API and cuda::pipeline functionality. What does memcpy_async do? Actually, it’s a producer-consumer model that producers copy data and consumers conduct computation.

1
2
3
4
5
6
7
   (create pipeline)
-> producer.acquire
-> (producer.submit)
-> producer.commit
-> consumer.wait
-> (consumer.compute)
-> consumer.release

Besides cuda::pipeline, there’s also pipeline primitives, which are C-style APIs.

  1. __pipeline_memcpy_async(target, source, size) copy from GMem to SMem, size must be 4,8,164, 8, 16
  2. __pipeline_commit()
  3. __pipeline_wait_prior()

Long Scoreboard Stalls -> Threads are waiting for some long-latency operations