Stream is like a FIFO queue, tasks in the same stream will be executed sequentially. By default, CUDA executes kernels in default stream (0 or nullptr). Or we can run kernels in named streams.
cudaStream_tcudaStreamCreate(&stream): explicitly create a named streamcudaStreamDestroy(stream)
CAUTION! By default, default stream and named streams are implicitly synchronized.
If we want default stream and named streams to be non-blocking, we only need to pass in flag when creating
1 | cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking); |