VideoRAG

Videos contain large amounts of knowledge, but traditional RAG focuses on text. Directly feeding long videos into multimodal LLMs is expensive and often infeasible.

Build a video knowledge base

Input is a list of videos. The deck emphasizes that there is no fixed limit on individual video duration or number of videos.

Multimodal video knowledge indexing

For each video:

Split the video into sub-clips.
Use a VLM to extract visual knowledge from frames/clips.
Use ASR to transcribe audio.
For each sub-clip, produce:
- Visual description.
- Audio/text transcript.
- Textual knowledge chunk.
Construct a knowledge graph from extracted textual knowledge.
Link information across multiple videos through entity-relation mappings.
Encode multimodal embeddings separately for:
- Text chunks.
- Video clips.
Build a hybrid index containing:
- Multimodal embeddings.
- Knowledge graphs.

Multimodal retrieval

Given a query:

Perform query reformulation or keyword extraction.
Retrieve video clips using:
- Textual semantic matching.
- Visual retrieval.
- Graph-based clip retrieval.
Filter retrieved clips according to query relevance.
Use a VLM again to extract fine-grained, query-specific knowledge from the selected clips.

Response Generation

The system concatenates:

Query-specific video knowledge.
Retrieved query-relevant text chunks.

Then it feeds this evidence into an LLM to generate the final answer.