MapReduce Architecture

Classic distributed architecture for Big Data.

CUDA Multiple GPU

1. Explicit Cross GPU Memcpy + Per-Device Operation Use DMA for P2P data transmission, otherwise use CPU for hopping. cudaS...

CUDA Data Transmission

Async Memory Move cudaMemcpy(), cudaMalloc() are explicitly synchronized. To make them async, use cudaMallocAsync(), cudaMem...

CUDA Multi Streaming

Stream is like a FIFO queue, tasks in the same stream will be executed sequentially. By default, CUDA executes kernels in def...

Latency Hiding: CUDA Async Pipeline Execution

After cuda kernel issuance, it returns immediately and host continues to execute, i.e., device and host executions are async....

存算重叠:双缓冲 (Double Buffering) 与多级流水线 (pipelining)

可以并行的条件?计算单元和访存单元是独立的硬件 Double Buffering: 设置两个相同的 SMem Buffer,用线程束特化,不同的 warp 负责不同的任务,e.g. 前一半 warp 负责计算,后一半 warp 负责存取. Late...

cuda 常用官方库

算子性能调优的话,block size 也是可以进行调优的,包括 kernel 实现的选择. cuDNN cuBLAS, cuBLASXt, cuBLASLt CUB 底层并行算法原语库,提供高效的 thread-level, wrap-level...

Rust: Multi-Processing

Rust 中的多进程编程

Concurrency in OS: Instruction Reordering & Memory Model

Go beyond concurrency programming, this article dives deeper into OS to discuss about concurrency related issues.

Rust: 并发编程

通过练习学一点 Rust 小知识,包含基础的线程创建、Arc 指针的运用、mpsc 的使用

12312