VidChain: Video Intelligence RAG Framework

Edge-optimized multimodal RAG framework for video understanding — transforms raw footage into a structured, queryable knowledge base.

Overview

VidChain v0.2.0 is a lightweight, modular framework that combines computer vision, OCR, speech recognition, emotion analysis, and LLM reasoning into a unified late-fusion pipeline. Designed to run on consumer-grade GPUs (tested on NVIDIA RTX 3050 4GB), it makes on-device video intelligence practical without cloud dependency.

At the heart is B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation) — a conversational AI copilot that translates raw sensor logs into human-readable narratives using abductive reasoning.

Core Pipeline

Video → WAV Extraction → Whisper ASR → Frame Loop →
  ├── YOLO (Objects)
  ├── MobileNetV3 (Action)
  ├── EasyOCR (Screen Text)
  ├── DeepFace (Emotion, threaded)
  └── TemporalTracker (Object Persistence + Camera Motion)
→ Semantic Fusion → ChromaDB → B.A.B.U.R.A.O. RAG

Key Capabilities

🧠 Dual-Brain Vision Engine

YOLO (Nouns): Detects objects with bounding boxes — "1 person, 1 laptop"
MobileNetV3 (Verbs): Classifies scene intent — NORMAL / SUSPICIOUS / VIOLENCE / EMERGENCY

🔤 Context-Aware OCR

EasyOCR runs only when YOLO detects readable surfaces (laptop, monitor, whiteboard) — saves compute while capturing ground-truth text.

😶 Threaded Emotion Analysis

DeepFace runs on CPU in a background thread so it never competes with YOLO/MobileNet for VRAM.

📡 Temporal Tracking

Object Persistence: IoU tracker assigns persistent IDs across frames (person #1 present 12s, moving left)
Camera Motion: Lucas-Kanade optical flow detects pan, tilt, zoom, static
Scene Cut Detection: HSV histogram correlation resets trackers on hard cuts

🗣️ B.A.B.U.R.A.O. RAG Engine

BGE embedder (BAAI/bge-base-en-v1.5) for domain-specific retrieval
Cross-encoder reranker for precision before LLM call
Intent routing — distinguishes video search from conversational follow-ups
Chat memory — maintains context across multi-turn conversations

Installation

pip install vidchain

# GPU-accelerated PyTorch (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

Run python scripts/check_gpu.py to verify CUDA is detected.

Quick Start

Python API (Library)

from vidchain import VidChain

# Initialize
vc = VidChain(config={
    "llm_provider": "gemini/gemini-2.5-flash",  # or "ollama/llama3" for offline
    "db_path": "./vidchain_storage"              # omit for in-memory (no persistence)
})

# Ingest a video
video_id = vc.ingest("surveillance.mp4")

# Query
print(vc.ask("what happened in the video?"))
print(vc.ask("was anyone acting suspiciously?"))

# Multi-video: scope query to a specific video
vc.ingest("cam1.mp4", video_id="cam1")
vc.ingest("cam2.mp4", video_id="cam2")
print(vc.ask("did anyone enter the room?", video_id="cam1"))

CLI

# Analyze and chat
vidchain-analyze video.mp4

# Single-shot query
vidchain-analyze video.mp4 --query "what happened at the desk?"

# Offline with Ollama
vidchain-analyze video.mp4 --llm ollama/llama3

# Multilingual OCR
vidchain-analyze video.mp4 --ocr-lang en fr

Train Custom Action Engine

# Place labeled images in data/train/<class>/
vidchain-train

Knowledge Base Schema

Each fused timeline entry contains all modalities at that moment:

{
    "time": 5.8,
    "duration": 3.2,
    "objects": "1 person, 1 laptop",
    "action": "SUSPICIOUS",
    "emotion": "visibly agitated",
    "ocr": "ASUS Vivobook",
    "audio": "I told you this would happen",
    "camera": "static",
    "tracking": ["person #1 (present 4.8s), moving left", "laptop #2 (present 5.8s)"],
    "audio_anomaly": "NORMAL"
}

Tech Stack

Component	Technology
Object Detection	YOLOv8s (Ultralytics)
Action Classification	MobileNetV3 (custom fine-tuned)
Speech Recognition	OpenAI Whisper (base)
OCR	EasyOCR
Emotion Analysis	DeepFace (opencv backend)
Temporal Tracking	IoU tracker + Lucas-Kanade optical flow
Embedder	`BAAI/bge-base-en-v1.5`
Reranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`
Vector Store	ChromaDB (persistent)
LLM Routing	LiteLLM (`gemini-2.5-flash` default, Ollama supported)
Scene Understanding	CLIP (`openai/clip-vit-base-patch32`)
GPU Runtime	CUDA 12.1 (4GB+ VRAM, RTX 30-series tested)

Developer Utilities

# List all indexed videos
vc.list_indexed_videos()

# Generate a narrative summary
vc.summarize_video(video_id, depth="concise")  # or "detailed"

# Hot-swap LLM
vc.set_llm("ollama/llama3")

# Purge a specific video
vc.purge_storage(video_id="cam1")

# Purge everything
vc.purge_storage()

Roadmap

CLIP scene understanding — zero-shot environment classification (v0.3.0)
Adaptive audio filtering — energy gating, anomaly detection, segment merging (v0.3.0)
Multi-video scoped queries — vc.ask(query, video_id="cam1") (v0.3.0)
Graceful degradation — every engine fails independently (v0.3.0)
Real-time streaming — live camera ingestion with low-latency indexing
Cross-video subject tracking — link the same person across multiple camera feeds
Export to CSV — structured timeline export for downstream analysis

Contributing

Contributions, issues, and feature requests are welcome. Open a GitHub issue or submit a pull request.

Author

Rahul Sharma — B.Tech CSE, IIIT Manipur

License

Distributed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
examples		examples
models		models
scripts		scripts
vidchain		vidchain
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
knowledge_base.json		knowledge_base.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VidChain: Video Intelligence RAG Framework

Overview

Core Pipeline

Key Capabilities

🧠 Dual-Brain Vision Engine

🔤 Context-Aware OCR

😶 Threaded Emotion Analysis

📡 Temporal Tracking

🗣️ B.A.B.U.R.A.O. RAG Engine

Installation

Quick Start

Python API (Library)

CLI

Train Custom Action Engine

Knowledge Base Schema

Tech Stack

Developer Utilities

Roadmap

Contributing

Author

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VidChain: Video Intelligence RAG Framework

Overview

Core Pipeline

Key Capabilities

🧠 Dual-Brain Vision Engine

🔤 Context-Aware OCR

😶 Threaded Emotion Analysis

📡 Temporal Tracking

🗣️ B.A.B.U.R.A.O. RAG Engine

Installation

Quick Start

Python API (Library)

CLI

Train Custom Action Engine

Knowledge Base Schema

Tech Stack

Developer Utilities

Roadmap

Contributing

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages