MapReduce Architecture
Classic distributed architecture for Big Data.
Classic distributed architecture for Big Data.
1. Explicit Cross GPU Memcpy + Per-Device Operation Use DMA for P2P data transmission, otherwise use CPU for hopping. cudaS...
Async Memory Move cudaMemcpy(), cudaMalloc() are explicitly synchronized. To make them async, use cudaMallocAsync(), cudaMem...
Stream is like a FIFO queue, tasks in the same stream will be executed sequentially. By default, CUDA executes kernels in def...
After cuda kernel issuance, it returns immediately and host continues to execute, i.e., device and host executions are async....
可以并行的条件?计算单元和访存单元是独立的硬件 Double Buffering: 设置两个相同的 SMem Buffer,用线程束特化,不同的 warp 负责不同的任务,e.g. 前一半 warp 负责计算,后一半 warp 负责存取. Late...
算子性能调优的话,block size 也是可以进行调优的,包括 kernel 实现的选择. cuDNN cuBLAS, cuBLASXt, cuBLASLt CUB 底层并行算法原语库,提供高效的 thread-level, wrap-level...
Rust 中的多进程编程
Go beyond concurrency programming, this article dives deeper into OS to discuss about concurrency related issues.
通过练习学一点 Rust 小知识,包含基础的线程创建、Arc 指针的运用、mpsc 的使用