Releases: rahulsiiitm/videochain-python
VidChain v0.2.0: The Dual-Brain & B.A.B.U.R.A.O. Update
This release marks a major milestone for my semester project: the official PyPI launch of VidChain v0.2.0.
What started as a video analysis script has been completely re-architected into a production-ready, edge-optimized multimodal RAG framework. This version introduces a new visual processing pipeline, stabilizes GPU memory management for consumer hardware (tested on RTX 3050), and debuts a highly intelligent conversational persona.
🔥 Academic & Engineering Highlights
- Official PyPI Launch: The framework is now packaged and globally available via
pip install VidChain. - Dual-Brain Vision Architecture: Moved away from a single classifier to a decoupled approach. YOLOv8s now extracts "Nouns" (objects/entities), while a custom fine-tuned MobileNetV3 extracts "Verbs" (action intents like NORMAL, SUSPICIOUS, VIOLENCE).
- Smart OCR Integration: Integrated EasyOCR with context-aware triggers. The system only scans for text when readable surfaces (e.g., laptops, books, screens) are detected by YOLO, significantly reducing compute overhead.
- Meet B.A.B.U.R.A.O.: Replaced the standard RAG log-reader with a conversational AI copilot named B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation). It utilizes abductive reasoning to translate raw, flickering sensor data into natural, contextual narratives.
Edge-First Hardware Optimizations
- Stable GPU Inference: Diagnosed and bypassed a critical mixed-precision (FP16/FP32) layer-fusion bug in PyTorch/Ultralytics, ensuring stable, native CUDA execution on consumer GPUs without memory crashes.
- Semantic Chunking: The pipeline now compresses identical, consecutive visual frames in the timeline, preserving FAISS vector space and reducing LLM context-window bloat.
- Dependency Stabilization: Pinned
pillow < 12.0to resolve MoviePy fatal errors and unpinned stricttorchdependencies to allow seamless local CUDA environment setups.
Getting Started
# 1. Install the framework
pip install VidChain
# 2. Install hardware-accelerated PyTorch (CUDA 12.1 example)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121) --force-reinstall
# 3. Run analysis
vidchain-analyze your_video.mp4v0.1.0-alpha: Core Multimodal Pipeline
VideoChain v0.1.0: Initial Multimodal Release
Overview
VideoChain is a lightweight, multimodal Retrieval-Augmented Generation (RAG) framework designed to transform unstructured video data into a searchable, structured knowledge base. This initial release establishes the core pipeline architecture for synchronizing visual entity detection with auditory transcriptions to enable high-fidelity reasoning via Local Large Language Models (LLMs).
Core Components and Features
1. Vision Engine (YOLOv8 Integration)
- Implemented real-time object detection using the Ultralytics YOLOv8 architecture.
- Features motion-gated sampling to optimize inference speed by processing only frames with significant temporal changes.
- Provides automated entity counting and descriptive labeling for the knowledge base.
2. Audio Processing (Whisper & Librosa)
- Integrated OpenAI Whisper for robust speech-to-text transcription.
- Implemented acoustic feature extraction using Librosa to detect RMS energy spikes and volume fluctuations.
- Supports FP16 inference for optimized performance on NVIDIA hardware (RTX 3050 and similar).
3. Temporal-Spatial Fusion Engine
- Engineered a centralized fusion engine to synchronize vision and audio streams into a unified chronological timeline.
- Outputs a standardized
knowledge_base.jsonthat serves as the ground-truth context for the RAG engine. - Resolved critical data-mapping issues related to timestamp synchronization and key-value consistency.
4. Retrieval-Augmented Generation (RAG)
- Established a communication bridge with the Ollama inference server.
- Default support for Llama 3.2, providing local, privacy-focused reasoning over the extracted video data.
- Optimized prompt templates for security-focused analysis and general scene understanding.
Technical Infrastructure
- Distribution: Full PEP 517 compliance for distribution via the Python Package Index (PyPI).
- Dependency Management: Automated resolution for OpenCV, Torch, Ultralytics, and Whisper.
- Dynamic Pathing: Implemented system-agnostic binary resolution for FFmpeg, ensuring cross-platform compatibility without hardcoded local paths.
Installation
pip install videochainRequirements
- Python 3.11 or higher
- FFmpeg (must be available in system PATH)
- Ollama (running Llama 3.2 or compatible models)