Stream is like a FIFO queue, tasks in the same stream will be executed sequentially. By default, CUDA executes kernels in default stream (0 or nullptr). Or we can run kernels in named streams.

  • cudaStream_t
  • cudaStreamCreate(&stream): explicitly create a named stream
  • cudaStreamDestroy(stream)

CAUTION! By default, default stream and named streams are implicitly synchronized.

If we want default stream and named streams to be non-blocking, we only need to pass in flag when creating

1
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);