Videos contain large amounts of knowledge, but traditional RAG focuses on text. Directly feeding long videos into multimodal LLMs is expensive and often infeasible.
Build a video knowledge base
Input is a list of videos. The deck emphasizes that there is no fixed limit on individual video duration or number of videos.
Multimodal video knowledge indexing
For each video:
- Split the video into sub-clips.
- Use a VLM to extract visual knowledge from frames/clips.
- Use ASR to transcribe audio.
- For each sub-clip, produce:
- Visual description.
- Audio/text transcript.
- Textual knowledge chunk.
- Construct a knowledge graph from extracted textual knowledge.
- Link information across multiple videos through entity-relation mappings.
- Encode multimodal embeddings separately for:
- Text chunks.
- Video clips.
- Build a hybrid index containing:
- Multimodal embeddings.
- Knowledge graphs.
Multimodal retrieval
Given a query:
- Perform query reformulation or keyword extraction.
- Retrieve video clips using:
- Textual semantic matching.
- Visual retrieval.
- Graph-based clip retrieval.
- Filter retrieved clips according to query relevance.
- Use a VLM again to extract fine-grained, query-specific knowledge from the selected clips.
Response Generation
The system concatenates:
- Query-specific video knowledge.
- Retrieved query-relevant text chunks.
Then it feeds this evidence into an LLM to generate the final answer.