The problem RAG-Anything is addressing is that previous RAG systems are often text-centric. Limitations include:
- text-only focus
- knowledge loss from ignoring images, tables, equations and slides
- poor parsing of PDFs, PPTs and complex layouts
- restricted applicability to real enterprise or academic documents
RAG-Anything proposes a unified framework for multimodal retrieval across text, visual elements, tables and math expressions.
Universal Indexing
- Multimodal canonicalizaton. We decompose documents into atomic units, i.e., text, image, table, equation.
- Graph construction. We will two 2 graphs: multimodal graph and text graph:
- Multimodal graph: anchor non-texts to surrounding texts. And preserve document layout and cross-modal relations
- Text graph: entity are extracted from textual chunks
- We then merge graphs through entity alignment.
- We further encode entities, relations, chunks and multimodal elements into dense vectors.
Cross-modal Hybrid Retrieval
- Analyze query modality requirements.
- create unified query embedding
- run graph-based retrieval to capture entity/relation structure
- run embedding-based retrieval to capture fine-grained semantic similarity
- fuse scores from
- graph signals
- embedding similarity
- query modality cues
- return cross-modal evidence
Knowledge-Enhanced Generation
The generator receives both textual and visual/structured evidence. It synthesizes an answer using multimodal context rather than plain text chunks alone.