Skip to content

Releases: rahulsiiitm/videochain-python

VidChain v0.2.0: The Dual-Brain & B.A.B.U.R.A.O. Update

04 Apr 05:40

Choose a tag to compare

This release marks a major milestone for my semester project: the official PyPI launch of VidChain v0.2.0.

What started as a video analysis script has been completely re-architected into a production-ready, edge-optimized multimodal RAG framework. This version introduces a new visual processing pipeline, stabilizes GPU memory management for consumer hardware (tested on RTX 3050), and debuts a highly intelligent conversational persona.

🔥 Academic & Engineering Highlights

  • Official PyPI Launch: The framework is now packaged and globally available via pip install VidChain.
  • Dual-Brain Vision Architecture: Moved away from a single classifier to a decoupled approach. YOLOv8s now extracts "Nouns" (objects/entities), while a custom fine-tuned MobileNetV3 extracts "Verbs" (action intents like NORMAL, SUSPICIOUS, VIOLENCE).
  • Smart OCR Integration: Integrated EasyOCR with context-aware triggers. The system only scans for text when readable surfaces (e.g., laptops, books, screens) are detected by YOLO, significantly reducing compute overhead.
  • Meet B.A.B.U.R.A.O.: Replaced the standard RAG log-reader with a conversational AI copilot named B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation). It utilizes abductive reasoning to translate raw, flickering sensor data into natural, contextual narratives.

Edge-First Hardware Optimizations

  • Stable GPU Inference: Diagnosed and bypassed a critical mixed-precision (FP16/FP32) layer-fusion bug in PyTorch/Ultralytics, ensuring stable, native CUDA execution on consumer GPUs without memory crashes.
  • Semantic Chunking: The pipeline now compresses identical, consecutive visual frames in the timeline, preserving FAISS vector space and reducing LLM context-window bloat.
  • Dependency Stabilization: Pinned pillow < 12.0 to resolve MoviePy fatal errors and unpinned strict torch dependencies to allow seamless local CUDA environment setups.

Getting Started

# 1. Install the framework
pip install VidChain

# 2. Install hardware-accelerated PyTorch (CUDA 12.1 example)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121) --force-reinstall

# 3. Run analysis
vidchain-analyze your_video.mp4

v0.1.0-alpha: Core Multimodal Pipeline

02 Apr 18:24

Choose a tag to compare

Pre-release

VideoChain v0.1.0: Initial Multimodal Release

Overview

VideoChain is a lightweight, multimodal Retrieval-Augmented Generation (RAG) framework designed to transform unstructured video data into a searchable, structured knowledge base. This initial release establishes the core pipeline architecture for synchronizing visual entity detection with auditory transcriptions to enable high-fidelity reasoning via Local Large Language Models (LLMs).


Core Components and Features

1. Vision Engine (YOLOv8 Integration)

  • Implemented real-time object detection using the Ultralytics YOLOv8 architecture.
  • Features motion-gated sampling to optimize inference speed by processing only frames with significant temporal changes.
  • Provides automated entity counting and descriptive labeling for the knowledge base.

2. Audio Processing (Whisper & Librosa)

  • Integrated OpenAI Whisper for robust speech-to-text transcription.
  • Implemented acoustic feature extraction using Librosa to detect RMS energy spikes and volume fluctuations.
  • Supports FP16 inference for optimized performance on NVIDIA hardware (RTX 3050 and similar).

3. Temporal-Spatial Fusion Engine

  • Engineered a centralized fusion engine to synchronize vision and audio streams into a unified chronological timeline.
  • Outputs a standardized knowledge_base.json that serves as the ground-truth context for the RAG engine.
  • Resolved critical data-mapping issues related to timestamp synchronization and key-value consistency.

4. Retrieval-Augmented Generation (RAG)

  • Established a communication bridge with the Ollama inference server.
  • Default support for Llama 3.2, providing local, privacy-focused reasoning over the extracted video data.
  • Optimized prompt templates for security-focused analysis and general scene understanding.

Technical Infrastructure

  • Distribution: Full PEP 517 compliance for distribution via the Python Package Index (PyPI).
  • Dependency Management: Automated resolution for OpenCV, Torch, Ultralytics, and Whisper.
  • Dynamic Pathing: Implemented system-agnostic binary resolution for FFmpeg, ensuring cross-platform compatibility without hardcoded local paths.

Installation

pip install videochain

Requirements

  • Python 3.11 or higher
  • FFmpeg (must be available in system PATH)
  • Ollama (running Llama 3.2 or compatible models)