This repository contains an audio-visual event recognition pipeline built on:
- EPIC-SOUNDS audio event detection
- prompted visual detection with YOLO-World + FastSAM
- simple event rules defined in JSON recipe files
The main pipeline lives in AV_EventRecognition/run_av_event_pipeline.py.
This repository is intended to publish source code, recipes, and environment files only.
Not committed to Git:
- model checkpoints and weights (
*.pt,*.pyth,*.pth,*.ckpt,*.ts) - EPIC-KITCHENS videos, extracted audio, HDF5 files, and other dataset artifacts
- generated pipeline outputs, logs, archives, and local editor files
AudioInceptionNeXt is tracked as a Git submodule. Clone with submodules when setting up a new machine:
git clone --recurse-submodules https://github.com/<your-user>/<your-repo>.gitSee docs/PUBLISHING.md for the final publish checklist and Git commands.
You need these inputs before running the pipeline:
- A video file
- Example:
path\to\videos\P01_01.MP4
- Example:
- An HDF5 file created from extracted audio
- Example:
path\to\audio\wav_to_pickled.hdf5
- Example:
- An event recipe JSON file
- Example:
AV_EventRecognition/recipes/cutting_cucumber.json
- Example:
You also need these model/config files. They are intentionally not committed to this repository:
- Audio inference script
epic-sounds-annotations/src/tools/infer_hdf5_video.py
- Audio config
epic-sounds-annotations/src/configs/EPIC-Sounds/ssast/SSAST_b_vit_p16.yaml
- Audio checkpoint
SSAST_EPIC_SOUNDS.pyth- Download link: SSAST_EPIC_SOUNDS.pyth
- Visual inference script
ApplyYOLOE/predict_yoloe_seg_video.py
- Visual model file names
yolov8l-world.ptor another YOLO-World checkpoint name. Ultralytics can download these automatically.FastSAM-x.ptorFastSAM-s.pt. Ultralytics can download these automatically.
The current AV pipeline defaults use the fastsam-world backend, so the main visual weights you need are:
- a YOLO-World checkpoint name
- a FastSAM checkpoint name
Create conda environment using the following commands
conda env create -f environment_avod_py171.yml
conda activate avod_py171
Run everything from the avod_py171 environment.
Example:
conda run -n avod_py171 python ...Audio extraction and HDF5 creation should be done before this pipeline.
- Clone this repo
git clone https://github.com/ekazakos/auditory-slow-fast.git
- Make sure you are in the avod_py171 environment
conda activate avod_py171
- Extract audio from the videos by running :
python audio_extraction/extract_audio.py /path/to/video /output/path
- Save audio in HDF5 format by running:
python audio_extraction/wav_to_hdf5.py /path/to/extracted/audio /output/hdf5/wav_to_hdf5_audio.hdf5
This command will:
- generate the audio CSV from the HDF5 file
- find the audio event from the recipe
- extract the matching video clip
- run visual detection
- apply the AV event rule
Example:
conda run -n avod_py171 python AV_EventRecognition/run_av_event_pipeline.py `
--video "path\to\videos\P01_01.MP4" `
--audio-csv "P01_01_ssast_ann_autogen.csv" `
--audio-output-summary-csv "P01_01_ssast_ann_autogen_summary.csv" `
--auto-generate-audio-csv `
--audio-cfg "epic-sounds-annotations\src\configs\EPIC-Sounds\ssast\SSAST_b_vit_p16.yaml" `
--audio-checkpoint "SSAST_EPIC_SOUNDS.pyth" `
--audio-hdf5 "path\to\audio\wav_to_pickled.hdf5" `
--audio-video-id "P01_01" `
--audio-window-mode annotation_regions `
--audio-annotations-file "epic-sounds-annotations\EPIC_Sounds_train.csv" `
--recipe-json "AV_EventRecognition\recipes\cutting_cucumber.json" `
--output-dir "AV_EventRecognition\outputs\cutting_cucumber" `
--max-clips 1This command uses placeholder paths so it can be adapted after cloning the repository:
conda run -n avod_py171 python AV_EventRecognition\run_av_event_pipeline.py `
--video "path\to\videos\P01_01.MP4" `
--audio-csv "AV_EventRecognition\outputs\cutting_cucumber\P01_01_ssast_ann_autogen.csv" `
--audio-output-summary-csv "AV_EventRecognition\outputs\cutting_cucumber\P01_01_ssast_ann_autogen_summary.csv" `
--auto-generate-audio-csv `
--audio-checkpoint "path\to\models\SSAST_EPIC_SOUNDS.pyth" `
--audio-hdf5 "path\to\audio\wav_to_pickled.hdf5" `
--audio-video-id "P01_01" `
--audio-window-mode annotation_regions `
--audio-annotations-file "epic-sounds-annotations\EPIC_Sounds_train.csv" `
--recipe-json "AV_EventRecognition\recipes\cutting_cucumber.json" `
--output-dir "AV_EventRecognition\outputs\cutting_cucumber" `
--model "path\to\models\yoloe-11s-seg.pt" `
--world-model "path\to\models\yolov8m-world.pt" `
--fastsam-model "path\to\models\FastSAM-s.pt" `
--max-clips 1The pipeline writes results into the output directory you provide.
Typical output structure:
clips/rendered/predictions/reports/summary.json
Each event is defined by a JSON recipe file in AV_EventRecognition/recipes.
A recipe tells the pipeline:
- which audio label to look for
- what final event name to assign
- which visual prompts to run
- which visual classes must be present
The visual prompts in the recipe are used with the YOLO-World + FastSAM detection stage.
Example:
{
"audio_label": "cut / chop",
"positive_event_name": "chopping cucumber",
"fallback_event_name": "other chopping event",
"prompts": ["cutting board", "knife", "cucumber", "food other than cucumber"],
"must_have": [
["cutting board", "chopping board"],
["knife"],
["cucumber"]
],
"conf": 0.01,
"distance_threshold_ratio": 0.06,
"min_support_frames": 3,
"clip_pad_before": 1.0,
"clip_pad_after": 1.0
}audio_labelpositive_event_namefallback_event_namepromptsmust_have
Each entry in must_have is one required class group.
Examples:
["knife"]["cutting board", "chopping board"]["food", "cucumber"]
This means:
- one detection from each group must be found
- if a group has multiple names, any one of them can satisfy that group
AV_EventRecognition/recipes/cutting_cucumber.jsonAV_EventRecognition/recipes/chopping_celery.jsonAV_EventRecognition/recipes/washing_carrot.jsonAV_EventRecognition/recipes/washing_cucumber.json
To visualize one report as an evidence video:
conda run -n avod_py171 python AV_EventRecognition/visualize_event_evidence.py `
--report-json "AV_EventRecognition\outputs\cutting_cucumber\reports\hit_00031_0108_277_0109_074.json"Disclaimer: Parts of this repository, including code, documentation, and workflow setup, were developed with AI assistance using Codex.
No license file has been selected yet. Add a license before making the repository public if you want others to reuse, modify, or redistribute the code.