Videos contain large amounts of knowledge, but traditional RAG focuses on text. Directly feeding long videos into multimodal LLMs is expensive and often infeasible.

Build a video knowledge base

Input is a list of videos. The deck emphasizes that there is no fixed limit on individual video duration or number of videos.

Multimodal video knowledge indexing

For each video:

  1. Split the video into sub-clips.
  2. Use a VLM to extract visual knowledge from frames/clips.
  3. Use ASR to transcribe audio.
  4. For each sub-clip, produce:
    • Visual description.
    • Audio/text transcript.
    • Textual knowledge chunk.
  5. Construct a knowledge graph from extracted textual knowledge.
  6. Link information across multiple videos through entity-relation mappings.
  7. Encode multimodal embeddings separately for:
    • Text chunks.
    • Video clips.
  8. Build a hybrid index containing:
    • Multimodal embeddings.
    • Knowledge graphs.

Multimodal retrieval

Given a query:

  1. Perform query reformulation or keyword extraction.
  2. Retrieve video clips using:
    • Textual semantic matching.
    • Visual retrieval.
    • Graph-based clip retrieval.
  3. Filter retrieved clips according to query relevance.
  4. Use a VLM again to extract fine-grained, query-specific knowledge from the selected clips.

Response Generation

The system concatenates:

  • Query-specific video knowledge.
  • Retrieved query-relevant text chunks.

Then it feeds this evidence into an LLM to generate the final answer.

RAG