202603-26MapReduce Architecture03-25CUDA Multiple GPU03-25CUDA Data Transmission03-25CUDA Multi Streaming03-25Latency Hiding: CUDA Async Pipeline Execution03-25存算重叠:双缓冲 (Double Buffering) 与多级流水线 (pipelining)03-25cuda 常用官方库03-24Rust: Multi-Processing03-24Concurrency in OS: Instruction Reordering & Memory Model03-23Rust: 并发编程03-23Tuning Technique03-19.pth Model Format of PyTorch03-18Sage Attention v1,v2,v3 代码梳理 (2): 03-18[Paper] Sage Attention v2 与 v2++03-18[Paper] Sage Attention v1: 对 Attention 的 INT8 PTQ03-18[Paper] HO-SFL: Hybrid-Order Split Federated Learning with BP-Free Client and Dimension-Free Aggregation03-17ReAct Agent 框架03-17单例模式 (Singleton)03-17Git Snippets: hard reset + soft reset + merge 清理复杂历史03-17Git Snippets: 本地合并上游分支03-16NVIDIA GPU 大学习之 Tensor Core03-16CUDA 算子优化:Warp Divergence03-16CUDA 算子优化:ILP03-16CUDA 算子优化:微指令调优03-16CUDA 算子优化:PTX03-16CUDA 算子优化:量化03-16Design Pattern: Factory Method03-16ninetoothed: CodeGenerator workflow03-11Rust Iterators03-11Rust Trait (3): TryFrom, TryInto03-11ninetoothed 项目整理03-10Rust STL (2): Vector03-10Rust STL (1): HashMap03-10Rust Trait (2): From, Into03-10Rust Trait (1): AsRef, AsMut03-10LLM Inference (1): Chat Server 与流式输出03-09gflags 简易指南:C++ 命令行参数解析库03-09模型训练框架:Model Checkpoints03-09PyTorch 中的 AutoGrad 机制03-09Activation Checkpointing03-09ZeRO: Zero Redundancy Optimizer03-09分布式训练03-08PyTorch Extension: 算子集成03-08Sage Attention v1,v2,v3 代码梳理 (1):INT8 Per-Block Quant Kernel03-07Bank Conflict03-07GPU Parallelism: PTX03-07Memory Alignment & Coalescing03-07SIMD 优化03-07Nsight Compute 简易指南03-07Roofline 模型03-07cuda-gdb 简易指南03-07CUDA 查询设备信息03-07CUDA Technique: Grid-Strided Loop03-07Nsight Systems 简易指南03-07CUDA 编译流程03-07GPU Architecture for CUDA03-07CUDA Optimization: Swizzling03-07CUDA Kernel: ArgMax03-06AI Infra Engineering: Abstraction03-06Git Snippets: 合并 Commits03-05PyTorch ATen 算子体系03-05InfiniTensor AI Compiler v2.0 整理:GraphBuilder03-05Raft Consensus Protocol03-05[Paper] Merge Then Compress03-05计算平台中的 slurm & srun 简易指南03-04InfiniTensor AI Compiler v2.0 整理03-04Python 与 C/C++ 联合开发(二):Pybind1103-04NumPy 与 PyTorch 在数据格式上的互转与二进制存储03-03NF4 Dequant CUDA Kernel 优化过程 (1)03-03Rust 的智能指针03-01[Paper] QLoRA 解读:LLM 4-bit 方案与双层量化03-01Arch Linux 下所有 CUDA 开发相关的包03-01Git Snippets: 先 clone 后下载 submodule02-26常用分布02-26Git Snippets: 将原仓库下的新分支同步到自己 fork 的仓库中02-25Typst 里好用的 package 以及常用设置02-25在 ArchLinux 上从零构建 RISC-V Linux 并使用 QEMU 进行模拟02-24Bash Associative Array (Dictionary)02-23Rust: Crate & Package & Module02-22Rust 泛型02-22C++ 智能指针与资源管理02-22Google C++ 风格指南02-22Python Decorator02-22C++ 的 static 关键字02-21Python 与 C/C++ 联合开发(一):ctypes 库02-19[Paper] Does Training with Synthetic Data Truly Protect Privacy?02-19张量的存储布局、步长以及张量操作的关系02-19用 Foundry 工具链开发智能合约02-19Solidity 重要语法02-18OCaml:基础类型02-18OCaml:基础语法02-18Haskell 高级概念总结02-16Haskell: Applicative02-16Practical Haskell: Text/ByteString & Web HTTP02-15Haskell: Monads & Applicative02-15Haskell: Functors02-15Haskell 中的 IO02-15How does Haskell work02-15Haskell 类型系统02-15Haskell 基础语法02-15Git Snippets: 从旧 commit 分叉出新 branch02-15[Paper] LoRA Fine-tuning02-15Triton 编写 Flash Attention02-14cuda 编写 flash attention 算子02-12ArchLinux 下将 CapsLock 映射到 Escape02-09[Paper] Flash Mask: 在 Flash Attention 上任意掩码以适配不同任务02-08[Paper] Deepseek FP8 训练方案02-08[Paper] Flash Attention02-07[Paper] Sage Attention v302-07The Second Half of AI02-07nmcli 配置 HKU WiFi