VideoRAG

RAG extension for videos.

RAG

RAG-Anything

The problem RAG-Anything is addressing is that previous RAG systems are often text-centric. Limitations include: text-only f...

RAG

MiniRAG

The main concern of MiniRAG is cost, privacy and storage. MiniRAG is based on 3 observations: SLMs are weak at semantic unde...

RAG

LightRAG

A light-weight improvement of GraphRAG.

RAG

Deepseek FP8 训练方案

Deepseek v3 发布的时候,也披露了其 FP8 训练的 solution 和 pipeline.实际上低精度训练框架的重要性也在日益凸显,毕竟又快又好就是王道.

Tech Reports

Rotary Position Embedding

RoPE

Techniques

Pipeline Parallelism

Parallelize the pipeline by overlapping stages.

Activation Checkpointing

In essence, activation checking is to replace memory with computation. During forward, only a few activation checkpoints are ...

Training System

ZeRO: Zero Redundancy Optimizer

针对分布式训练的场景,对模型权重、优化器状态进行切分,从而减少显存占用

Training System/Distributed

Radio: Rate-Distortion Optimization for LLM Compression

Propose a framework for analysis on model quantization.

Optimization/Compression Optimization/Quantization/Low Bit

GraphRAG

An integration of knowledge graph and RAG to improve the accuracy of RAG under complex scenarios.

RAG

Prefix Caching (RadixAttention)

Prefix caching is an optimization technique on top of KV cache that is massively exploited in multi-turn session.

Inference System/KV Cache Optimization

Prefill Decode Disaggregation

LLM Inference 中的 PD 分离技术

Inference System/KV Cache Optimization

KV Cache 技术

KV Cache 是支撑让大模型记住超长上下文的关键技术,也是大模型推理中最重要的优化之一。

Inference System/KV Cache Optimization

Sage Attention v3

Sage Attention v3,相比之前的两份工作更进一步,提出了 FP4 推理和 INT8 训练框架。

Attention Study/Low Bit Optimization/Quantization/Low Bit

Sage Attention v2 与 v2++

第二版 Sage Attention 以及其改良

Attention Study/Low Bit Optimization/Quantization/Low Bit

Sage Attention v1: 对 Attention 的 INT8 PTQ

将低精度方法应用在 Flash Attention 上,computation pattern 和 Flash Attention 是一样的,整体的提速主要来自于低精度计算的提速减去量化的 overhead,当然同时也保证了一定的精度.

Attention Study/Low Bit Optimization/Quantization/Low Bit

QLoRA 解读:LLM 4-bit 方案与双层量化

算法 2-Level (Double) Quantization QLoRA 使用了两阶段量化的方案,我们先来说说量化是怎么个流程,需要保存哪些个变量。 First Level Quantization 对于输入的权重,假设其为大小 R×CR\...

Post Training/SFT Optimization/Quantization/Low Bit

Markov Decision Processes

Markov Decision Process (MDP) serves as the theoretical foundation of RL.

Post Training/RL/Foundation

Formulation of Reinforcement Learning

Briefly introduce the general background of RL.

Post Training/RL/Foundation

Paged Attention

Adopt ideas of memory management from operating system and apply it to KV Cache management during inference, bringing performance boost. This work further gives birth to vLLM, a popular LLM deployment framework.

Inference System/KV Cache Optimization

Tuning Technique

根据算法和硬件进行自动搜索,选择最优的参数配置,以最大化程序效率

AI Compiler

Memory Optimization in AI Compiler

内存管理优化的思路主要是: 调整分配、释放的时间点和次数,减少分配、释放内存给运行时带来的时间开销 优化分配过程,尽可能减少总的内存占用。例如,通过跟踪计算图,只分配 peak 所需的内存 常规的做法的话,一般跑两次: 第一次是 AI Comp...

AI Compiler

Operator-Level Optimization in AI Compilers

Some optimization techniques for operators in AI compilers.

AI Compiler

Multi-Head Attention & Multi-Head Self-Attention

MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.

Attention Study/Foundation

Self Attention

Self attention is built on top of attention, meant to capture relationship between tokens in the same sequence.

Attention Study/Foundation

Naive Attention

The foundation of the whole AI.

Attention Study/Foundation

LeSTD: Learning-Based Sparse Tensor Decomposition

Focuses on problems in LLM Compression.

Optimization/Compression

An Overview on Frontend Optimization of AI Compiler

图层优化 将一种计算图结构,在不改变算数结果的情况下,基于设定好的规则,对计算图进行相应的图替换操作. 读写冗余:一些计算场景中存在重复读写内存、或者内存访问不连续,降低 cache hit rate,导致多余的内存传输 结构冗余:模型存在无效的...

AI Compiler

An Overview on AI Compiler

从 Top Level 视角看 AI 编译器技术

AI Compiler

AutoGrad 自动微分

在 PyTorch 代码执行计算的时候,AutoGrad 会构建一张由 Function 对象组成的 DAG 计算图,用于反向传播.每一个 Function 对象表示操作,通过其 .apply() 计算前向传播结果,并记录其反向传播的逻辑 .grad...

Training System/原理

Q-RAG

Accepted by ICLR 2026. To improve LLMs' ability on multi-step retrieval, instead of fine-tuning LLMs, a method that use RL to train embedding models is proposed.

RAG

Flash Mask: 在 Flash Attention 上任意掩码以适配不同任务

看 CUDA 代码真是一种享受啊(呕

Attention Study

LoRA Fine-tuning

似乎已经成为工业界快速针对下游任务进行 SFT 的标准方法了(吗

Post Training/SFT