The problem RAG-Anything is addressing is that previous RAG systems are often text-centric. Limitations include:

  • text-only focus
  • knowledge loss from ignoring images, tables, equations and slides
  • poor parsing of PDFs, PPTs and complex layouts
  • restricted applicability to real enterprise or academic documents

RAG-Anything proposes a unified framework for multimodal retrieval across text, visual elements, tables and math expressions.

Universal Indexing

  1. Multimodal canonicalizaton. We decompose documents into atomic units, i.e., text, image, table, equation.
  2. Graph construction. We will two 2 graphs: multimodal graph and text graph:
    • Multimodal graph: anchor non-texts to surrounding texts. And preserve document layout and cross-modal relations
    • Text graph: entity are extracted from textual chunks
  3. We then merge graphs through entity alignment.
  4. We further encode entities, relations, chunks and multimodal elements into dense vectors.

Cross-modal Hybrid Retrieval

  1. Analyze query modality requirements.
  2. create unified query embedding
  3. run graph-based retrieval to capture entity/relation structure
  4. run embedding-based retrieval to capture fine-grained semantic similarity
  5. fuse scores from
    • graph signals
    • embedding similarity
    • query modality cues
  6. return cross-modal evidence

Knowledge-Enhanced Generation

The generator receives both textual and visual/structured evidence. It synthesizes an answer using multimodal context rather than plain text chunks alone.

RAG