RAG-Anything

The problem RAG-Anything is addressing is that previous RAG systems are often text-centric. Limitations include:

text-only focus
knowledge loss from ignoring images, tables, equations and slides
poor parsing of PDFs, PPTs and complex layouts
restricted applicability to real enterprise or academic documents

RAG-Anything proposes a unified framework for multimodal retrieval across text, visual elements, tables and math expressions.

Universal Indexing

Multimodal canonicalizaton. We decompose documents into atomic units, i.e., text, image, table, equation.
Graph construction. We will two 2 graphs: multimodal graph and text graph:
- Multimodal graph: anchor non-texts to surrounding texts. And preserve document layout and cross-modal relations
- Text graph: entity are extracted from textual chunks
We then merge graphs through entity alignment.
We further encode entities, relations, chunks and multimodal elements into dense vectors.

Analyze query modality requirements.
create unified query embedding
run graph-based retrieval to capture entity/relation structure
run embedding-based retrieval to capture fine-grained semantic similarity
fuse scores from
- graph signals
- embedding similarity
- query modality cues
return cross-modal evidence

Knowledge-Enhanced Generation

The generator receives both textual and visual/structured evidence. It synthesizes an answer using multimodal context rather than plain text chunks alone.

RAG-Anything

Universal Indexing

Cross-modal Hybrid Retrieval

Knowledge-Enhanced Generation