VideoRAG
RAG extension for videos.
RAG extension for videos.
The problem RAG-Anything is addressing is that previous RAG systems are often text-centric. Limitations include: text-only f...
The main concern of MiniRAG is cost, privacy and storage. MiniRAG is based on 3 observations: SLMs are weak at semantic unde...
A light-weight improvement of GraphRAG.
Deepseek v3 发布的时候,也披露了其 FP8 训练的 solution 和 pipeline.实际上低精度训练框架的重要性也在日益凸显,毕竟又快又好就是王道.
RoPE
Parallelize the pipeline by overlapping stages.
In essence, activation checking is to replace memory with computation. During forward, only a few activation checkpoints are ...
针对分布式训练的场景,对模型权重、优化器状态进行切分,从而减少显存占用
Propose a framework for analysis on model quantization.
An integration of knowledge graph and RAG to improve the accuracy of RAG under complex scenarios.
Prefix caching is an optimization technique on top of KV cache that is massively exploited in multi-turn session.
LLM Inference 中的 PD 分离技术
KV Cache 是支撑让大模型记住超长上下文的关键技术,也是大模型推理中最重要的优化之一。
Sage Attention v3,相比之前的两份工作更进一步,提出了 FP4 推理和 INT8 训练框架。
第二版 Sage Attention 以及其改良
将低精度方法应用在 Flash Attention 上,computation pattern 和 Flash Attention 是一样的,整体的提速主要来自于低精度计算的提速减去量化的 overhead,当然同时也保证了一定的精度.
算法 2-Level (Double) Quantization QLoRA 使用了两阶段量化的方案,我们先来说说量化是怎么个流程,需要保存哪些个变量。 First Level Quantization 对于输入的权重,假设其为大小 R×CR\...
Markov Decision Process (MDP) serves as the theoretical foundation of RL.
Briefly introduce the general background of RL.
Adopt ideas of memory management from operating system and apply it to KV Cache management during inference, bringing performance boost. This work further gives birth to vLLM, a popular LLM deployment framework.
根据算法和硬件进行自动搜索,选择最优的参数配置,以最大化程序效率
内存管理优化的思路主要是: 调整分配、释放的时间点和次数,减少分配、释放内存给运行时带来的时间开销 优化分配过程,尽可能减少总的内存占用。例如,通过跟踪计算图,只分配 peak 所需的内存 常规的做法的话,一般跑两次: 第一次是 AI Comp...
Some optimization techniques for operators in AI compilers.
MHA intends to capture information between tokens through different "aspects" with the design of "head"s, which is literally splitting the embedding dimension into several non-overlapping parts.
Self attention is built on top of attention, meant to capture relationship between tokens in the same sequence.
The foundation of the whole AI.
Focuses on problems in LLM Compression.
图层优化 将一种计算图结构,在不改变算数结果的情况下,基于设定好的规则,对计算图进行相应的图替换操作. 读写冗余:一些计算场景中存在重复读写内存、或者内存访问不连续,降低 cache hit rate,导致多余的内存传输 结构冗余:模型存在无效的...
从 Top Level 视角看 AI 编译器技术
在 PyTorch 代码执行计算的时候,AutoGrad 会构建一张由 Function 对象组成的 DAG 计算图,用于反向传播.每一个 Function 对象表示操作,通过其 .apply() 计算前向传播结果,并记录其反向传播的逻辑 .grad...
Accepted by ICLR 2026. To improve LLMs' ability on multi-step retrieval, instead of fine-tuning LLMs, a method that use RL to train embedding models is proposed.
看 CUDA 代码真是一种享受啊(呕
似乎已经成为工业界快速针对下游任务进行 SFT 的标准方法了(吗