In GPU programming, a “transaction” is the actual hardware memory operation issued to a memory subsystem (e.g., global memory, shared memory, L2, DRAM) to service requests from threads. Threads request memory accesses, but the hardware groups, splits, and aligns them into transactions that match the memory system’s granularity.
For NVIDIA GPUs, memory is accessed in fixed-size segments, like 32/64/128 bytes, meaning that even we only need 1 byte of data, the hardware will copy a segment of data in memory. A warp’s memory accesses are broken into one or more transactions depending on alignment and access patterns.
Here introduces an important topic in kernel optimization: memory alignment. The beginning address should be aligned to multiples of transaction size, to minimize the number of memory accesses.
Also, this introduces another important technique: memory coalescing. If we write appropriate code such that multiple memory accesses in the same warp can be completed with fewer transactions. So we can see that, instead of requiring memory accessing, if we can achieve the same effect with fewer reads, then the program will perform better. And thus, the condition for memory coalescing is:
- threads in the warp should be accessing continuous addresses.
- the beginning address should be aligned to the transaction size.
An example
For example, consider the following code:
1 | float x = global[threadIdx.x]; |
In the above example, for the 32 threads in the same warp, their memory access is continuous. The addresses are also well-aligned. Thus, we only need 1~4 transactions (dependent on hardware) for 32 threads.