Arca's Blog

VideoRAG

RAG extension for videos.

2026-05-15

RAG-Anything

The problem RAG-Anything is addressing is that previous RAG systems are often text-centric. Limitations include: text-only f...

2026-05-15

MiniRAG

The main concern of MiniRAG is cost, privacy and storage. MiniRAG is based on 3 observations: SLMs are weak at semantic unde...

2026-05-15

LightRAG

A light-weight improvement of GraphRAG.

2026-05-15

Deepseek FP8 训练方案

Deepseek v3 发布的时候，也披露了其 FP8 训练的 solution 和 pipeline．实际上低精度训练框架的重要性也在日益凸显，毕竟又快又好就是王道．

2026-05-15

Rotary Position Embedding

RoPE

2026-05-15

Pipeline Parallelism

Parallelize the pipeline by overlapping stages.

2026-05-15

Activation Checkpointing

In essence, activation checking is to replace memory with computation. During forward, only a few activation checkpoints are ...

2026-05-15

ZeRO: Zero Redundancy Optimizer

针对分布式训练的场景，对模型权重、优化器状态进行切分，从而减少显存占用

2026-05-14

Radio: Rate-Distortion Optimization for LLM Compression

Propose a framework for analysis on model quantization.

2026-05-14

GraphRAG

An integration of knowledge graph and RAG to improve the accuracy of RAG under complex scenarios.

2026-05-13

Prefix Caching (RadixAttention)

Prefix caching is an optimization technique on top of KV cache that is massively exploited in multi-turn session.

2026-05-13

Prefill Decode Disaggregation

LLM Inference 中的 PD 分离技术

2026-05-12

KV Cache 技术

KV Cache 是支撑让大模型记住超长上下文的关键技术，也是大模型推理中最重要的优化之一。

2026-05-12

Sage Attention v3

Sage Attention v3，相比之前的两份工作更进一步，提出了 FP4 推理和 INT8 训练框架。

2026-05-11

Sage Attention v2 与 v2++

第二版 Sage Attention 以及其改良

2026-05-11

Sage Attention v1: 对 Attention 的 INT8 PTQ

将低精度方法应用在 Flash Attention 上，computation pattern 和 Flash Attention 是一样的，整体的提速主要来自于低精度计算的提速减去量化的 overhead，当然同时也保证了一定的精度．

2026-05-11

QLoRA 解读：LLM 4-bit 方案与双层量化

算法 2-Level (Double) Quantization QLoRA 使用了两阶段量化的方案，我们先来说说量化是怎么个流程，需要保存哪些个变量。 First Level Quantization 对于输入的权重，假设其为大小 R×CR\...

2026-05-11

Markov Decision Processes

Markov Decision Process (MDP) serves as the theoretical foundation of RL.

2026-05-10

Formulation of Reinforcement Learning

Briefly introduce the general background of RL.

2026-05-10

Paged Attention

Adopt ideas of memory management from operating system and apply it to KV Cache management during inference, bringing performance boost. This work further gives birth to vLLM, a popular LLM deployment framework.

2026-05-10

Tuning Technique

根据算法和硬件进行自动搜索，选择最优的参数配置，以最大化程序效率

2026-05-06

Memory Optimization in AI Compiler

内存管理优化的思路主要是：调整分配、释放的时间点和次数，减少分配、释放内存给运行时带来的时间开销优化分配过程，尽可能减少总的内存占用。例如，通过跟踪计算图，只分配 peak 所需的内存常规的做法的话，一般跑两次：第一次是 AI Comp...

2026-05-06

Operator-Level Optimization in AI Compilers

Some optimization techniques for operators in AI compilers.

2026-05-06

Multi-Head Attention & Multi-Head Self-Attention

MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.

2026-05-02

Self Attention

Self attention is built on top of attention, meant to capture relationship between tokens in the same sequence.

2026-05-02

Naive Attention

The foundation of the whole AI.

2026-05-02

LeSTD: Learning-Based Sparse Tensor Decomposition

Focuses on problems in LLM Compression.

2026-05-01

An Overview on Frontend Optimization of AI Compiler

图层优化将一种计算图结构，在不改变算数结果的情况下，基于设定好的规则，对计算图进行相应的图替换操作．读写冗余：一些计算场景中存在重复读写内存、或者内存访问不连续，降低 cache hit rate，导致多余的内存传输结构冗余：模型存在无效的...

2026-04-28

An Overview on AI Compiler

从 Top Level 视角看 AI 编译器技术

2026-04-28

AutoGrad 自动微分

在 PyTorch 代码执行计算的时候，AutoGrad 会构建一张由 Function 对象组成的 DAG 计算图，用于反向传播．每一个 Function 对象表示操作，通过其 .apply() 计算前向传播结果，并记录其反向传播的逻辑 .grad...

2026-04-26

Q-RAG

Accepted by ICLR 2026. To improve LLMs' ability on multi-step retrieval, instead of fine-tuning LLMs, a method that use RL to train embedding models is proposed.

2026-04-26

Flash Mask: 在 Flash Attention 上任意掩码以适配不同任务

看 CUDA 代码真是一种享受啊（呕

2026-04-26

LoRA Fine-tuning

似乎已经成为工业界快速针对下游任务进行 SFT 的标准方法了（吗

2026-04-26