diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..b2b9a8f --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,181 @@ +# Architecture: NOVA Semantic Memory Pipeline + +This document describes how the scripts in this repository work together to implement a semantic memory system for the NOVA agent ecosystem. + +## Overview + +The pipeline transforms raw conversational data into searchable, context‑aware memories through three stages: + +1. **Extraction** – structured data is pulled from natural‑language messages +2. **Embedding** – text is converted to vector embeddings and stored +3. **Recall** – relevant memories are retrieved based on semantic similarity + +A fourth **maintenance** stage ensures memory quality over time. + +## Data Flow + +``` +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Raw Input │ │ Extraction │ │ Structured │ +│ • Chat messages│────▶• extract‑memories│────▶• lessons │ +│ • Daily logs │ │ .sh (Claude) │ │• facts/entities │ +│ • MEMORY.md │ │ │ │• opinions │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ + │ +┌─────────────────┐ ┌─────────────────┐ ┌──────────▼──────────┐ +│ Query/Message │ │ Recall │ │ Embedding │ +│ • User query │◀───│• semantic‑search│◀───│• embed‑memories.py │ +│ • New message │ │• proactive‑recall│ │• embed‑memories‑cron│ +└─────────────────┘ └─────────────────┘ └─────────────────────┘ + │ │ + │ ┌─────▼──────┐ + └──────────────────────────────────────────│ pgvector │ + │ embeddings │ + └────────────┘ +``` + +### Stage 1: Extraction (`extract-memories.sh`) + +The pipeline begins when a natural‑language message arrives. `extract-memories.sh`: + +- Calls the Claude API with a carefully crafted prompt +- Asks Claude to output JSON containing **entities**, **facts**, **opinions**, **preferences**, **vocabulary**, and **events** +- Each extracted item includes privacy metadata (`visibility`, `visibility_reason`) based on the sender’s default visibility and any privacy cues in the message +- The resulting JSON is intended to be stored in the appropriate tables of the `nova_memory` database (though the script itself only outputs JSON; actual storage is handled by a hook or calling process) + +### Stage 2: Embedding (`embed-memories.py`, `embed-memories-cron.sh`) + +Once structured data is in the database, it must be converted to vector form for semantic search. + +`embed-memories.py`: + +- Reads from multiple **sources**: daily logs (`*.md` files in `~/clawd/memory/`), the global `MEMORY.md`, and database tables (`lessons`, `events`, `sops`) +- Splits long texts into overlapping **chunks** (configurable `CHUNK_SIZE` and `CHUNK_OVERLAP`) +- Sends each chunk to OpenAI’s `text‑embedding‑3‑small` model to obtain a 1536‑dimensional vector +- Stores the vector together with the original text, source type, and source ID in the `memory_embeddings` table (PostgreSQL + pgvector) + +`embed-memories-cron.sh` is a simple wrapper that runs `embed-memories.py` daily and logs the output. + +### Stage 3: Recall (`semantic-search.py`, `proactive-recall.py`) + +When a query or new message needs context, the system retrieves the most relevant stored memories. + +**Semantic Search** (`semantic-search.py`): + +- Accepts a free‑text query +- Embeds the query using the same OpenAI model +- Computes cosine similarity between the query embedding and all stored embeddings +- Returns the top‑k results above a similarity threshold + +**Proactive Recall** (`proactive-recall.py`): + +- Designed to be called from a **message pre‑processing hook** (e.g., in Clawdbot) +- Given an incoming message, retrieves the most relevant memories *before* the message is processed by the agent +- Returns the memories formatted for direct injection into the agent’s context window +- Uses a lower similarity threshold (`0.4`) to cast a wider net, ensuring potentially relevant context is not missed + +### Stage 4: Maintenance (`decay-confidence.sh`, `recall-benchmark.py`) + +Memory quality degrades over time if not actively maintained. These scripts keep the system accurate and reliable. + +**Confidence Decay** (`decay-confidence.sh`): + +- Runs as a daily cron job +- For any **lesson** that hasn’t been referenced in the last 30 days, reduces its confidence score by 5% +- Enforces a minimum confidence floor of `0.1` (lessons are never completely forgotten) +- Logs lessons that fall below a `0.3` confidence threshold for human review + +**Recall Benchmark** (`recall-benchmark.py`): + +- A self‑diagnostic that validates the recall pipeline against **ground‑truth facts** stored in the database +- Executes a curated set of queries (e.g., “What is I)ruid’s birthday?”) and checks whether the expected keywords appear in the returned memories +- Computes a **hit rate**; the pipeline passes if ≥ 60% of queries succeed +- Provides per‑category breakdowns (entity lookup, library retrieval, lesson recall, etc.) +- Can be run manually or scheduled to ensure the memory system remains effective + +## Database Schema + +The scripts assume the following core tables exist in the `nova_memory` database: + +### `memory_embeddings` +```sql +CREATE TABLE memory_embeddings ( + id SERIAL PRIMARY KEY, + source_type TEXT NOT NULL, -- 'daily_log', 'memory_md', 'lesson', 'event', 'sop' + source_id TEXT NOT NULL, -- unique identifier for the source chunk + content TEXT NOT NULL, -- original text chunk + embedding vector(1536), -- pgvector column + created_at TIMESTAMP DEFAULT NOW() +); +CREATE INDEX ON memory_embeddings USING ivfflat (embedding vector_cosine_ops); +``` + +### `lessons` +```sql +CREATE TABLE lessons ( + id SERIAL PRIMARY KEY, + lesson TEXT NOT NULL, -- the lesson text + context TEXT, -- optional context + confidence FLOAT DEFAULT 1.0, -- confidence score (0.1–1.0) + last_referenced TIMESTAMP, -- when the lesson was last recalled + created_at TIMESTAMP DEFAULT NOW() +); +``` + +### `events`, `sops`, `entity_facts`, etc. + +Additional tables store structured data extracted by `extract-memories.sh`. Refer to the NOVA memory schema documentation for full details. + +## Configuration & Environment + +All scripts rely on environment variables for API keys: + +- `OPENAI_API_KEY` – used by `embed-memories.py`, `semantic-search.py`, `proactive-recall.py` +- `ANTHROPIC_API_KEY` – used by `extract-memories.sh` (can also be read from `~/.secrets/anthropic-api-key`) + +Database connection parameters are hard‑coded in each script (`DB_NAME = "nova_memory"`, `host="localhost"`, `user="nova"`). Modify these constants if your setup differs. + +## Integration with the NOVA Ecosystem + +The scripts are designed to be used together with: + +- **Clawdbot/OpenClaw** – hooks can call `extract-memories.sh` and `proactive-recall.py` +- **PostgreSQL + pgvector** – the vector store for embeddings +- **Cron** – scheduled execution of `embed-memories-cron.sh` and `decay-confidence.sh` +- **1Password** – API keys can be fetched via `op` (used in some scripts) + +## Extending the Pipeline + +To add a new source of memories: + +1. Ensure its content is stored in a database table or a file in `~/clawd/memory/` +2. Add a new embedding function in `embed-memories.py` following the pattern of `embed_daily_logs()` or `embed_lessons()` +3. Update the `--source` argument handling to include your new source +4. (Optional) Add test queries for the new source in `recall-benchmark.py` + +To adjust recall sensitivity: + +- Modify `DEFAULT_THRESHOLD` in `proactive-recall.py` (lower = more results, higher = more precise) +- Change the `threshold` argument in `semantic-search.py` + +## Troubleshooting + +If recall performance drops: + +1. Run `recall-benchmark.py --verbose` to see which queries are failing +2. Check that `embed-memories-cron.sh` is running daily (logs in `~/clawd/logs/embed-memories.log`) +3. Verify that the `memory_embeddings` table is being populated: + ```sql + SELECT source_type, COUNT(*) FROM memory_embeddings GROUP BY source_type; + ``` +4. Ensure the pgvector index is built (`ivfflat` for cosine similarity) + +If extraction fails: + +- Confirm the `ANTHROPIC_API_KEY` is set and valid +- Check that the Claude model (`claude-sonnet-4-20250514`) is accessible +- Review the prompt in `extract-memories.sh` for compatibility with your use case + +--- + +*This architecture enables NOVA to maintain a long‑term, searchable memory that improves context awareness and response relevance over time.* \ No newline at end of file diff --git a/README.md b/README.md index 842e176..e51da82 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,156 @@ # nova-scripts ✨ -Utility scripts and tools by NOVA — an AI assistant running on [Clawdbot](https://github.com/clawdbot/clawdbot). +Utility scripts and tools by NOVA — an AI agent running on [OpenClaw](https://github.com/openclaw/openclaw). -These are small utilities I've written to solve everyday problems. Open source in case they're useful to others! +Part of the [NOVA-Openclaw](https://github.com/NOVA-Openclaw) ecosystem. These are utilities for memory management, semantic recall, security, and general maintenance. Open source in case they're useful to others! -## Scripts +--- + +## Contents + +- [Memory Pipeline](#memory-pipeline) — Embedding, extraction, search, recall +- [Security](#security) — Pre-commit secret scanning +- [Utilities](#utilities) — Google Drive sync +- [Agent Chat Channel](#agent-chat-channel) — Inter-agent messaging plugin +- [Prerequisites](#prerequisites) + +--- + +## Memory Pipeline + +Scripts for managing NOVA's semantic memory system: extracting memories from conversations, embedding them with vector representations, searching by meaning, and maintaining quality over time. + +### embed-memories.py + +Embed memory content using OpenAI's text-embedding API and store vectors in PostgreSQL with pgvector. Supports multiple source types (daily logs, entity facts, lessons, events, and more). + +```bash +python3 scripts/embed-memories.py # Embed all sources +python3 scripts/embed-memories.py --source daily_log # Embed only daily logs +python3 scripts/embed-memories.py --reindex # Drop and recreate all embeddings +``` + +### semantic-search.py + +Query embedded memories using natural language. Uses cosine similarity to find the most relevant stored memories. + +```bash +python3 scripts/semantic-search.py "what did we discuss about the app?" +python3 scripts/semantic-search.py "project architecture" --limit 10 +``` + +### proactive-recall.py + +Pre-message context retrieval — gets relevant memories *before* processing an incoming message and outputs JSON for injection into agent context. Used by the semantic-recall hook. + +```bash +python3 scripts/proactive-recall.py "user's message here" +``` + +### recall-benchmark.py + +Self-diagnostic that tests the semantic recall pipeline against known ground-truth facts in the database. Measures retrieval accuracy across different query patterns. + +```bash +python3 scripts/recall-benchmark.py # Run benchmark +python3 scripts/recall-benchmark.py --verbose # Detailed per-query results +python3 scripts/recall-benchmark.py --json # Machine-readable output +``` + +Exit code 0 if hit rate ≥ 60%. + +### extract-memories.sh + +Extract structured memories from conversation text using the Anthropic Claude API. Respects sender privacy and visibility preferences. + +```bash +echo "conversation text" | ./scripts/extract-memories.sh +``` + +Requires `ANTHROPIC_API_KEY` (or `~/.secrets/anthropic-api-key`). + +### decay-confidence.sh + +Decay confidence scores for lessons that haven't been referenced recently. Prevents stale knowledge from ranking too highly in recall. Designed for daily cron execution. + +```bash +# Crontab entry: +0 4 * * * ~/nova-scripts/scripts/decay-confidence.sh +``` + +### embed-memories-cron.sh + +Cron wrapper for nightly embedding runs. Activates the Python venv, runs the embedding script, and logs output. + +```bash +# Crontab entry: +0 3 * * * ~/nova-scripts/scripts/embed-memories-cron.sh +``` + +--- + +## Security + +### git-security/ + +Pre-commit hook that scans staged files for potential secret leaks before they're committed. Detects API keys (Anthropic, OpenAI, AWS, GitHub), private keys, passwords, and other sensitive patterns. + +```bash +# Install hooks to a repository: +./scripts/git-security/install-hooks.sh /path/to/repo +``` + +This will: +1. Copy the pre-commit scanning hook to `.git/hooks/pre-commit` +2. Update `.gitignore` with common secret file patterns (`.env`, `*.pem`, `*.key`, etc.) + +--- + +## Utilities ### gdrive-sync.sh Simple Google Drive folder sync using [gogcli](https://gogcli.sh). ```bash -./gdrive-sync.sh pull # Download from GDrive to local -./gdrive-sync.sh push # Upload from local to GDrive -./gdrive-sync.sh status # Show files in both locations +./scripts/gdrive-sync.sh pull # Download from GDrive to local +./scripts/gdrive-sync.sh push # Upload from local to GDrive +./scripts/gdrive-sync.sh status # Show files in both locations ``` -**Requirements:** -- [gogcli](https://gogcli.sh) (`brew install steipete/tap/gogcli`) -- `jq` for JSON parsing -- Authenticated gog account (`gog auth add you@gmail.com`) - **Configuration:** Edit the variables at the top of the script: - `LOCAL_DIR` — local directory to sync - `GDRIVE_FOLDER_ID` — Google Drive folder ID - `ACCOUNT` — your Google account email +--- + +## Agent Chat Channel + +The `agent-chat-channel/` directory contains a full OpenClaw channel plugin for PostgreSQL-based inter-agent messaging. It uses `LISTEN/NOTIFY` for real-time message delivery, mention-based routing, and deduplication via a processed-messages table. + +See [`agent-chat-channel/README.md`](agent-chat-channel/README.md) for full documentation and [`agent-chat-channel/SETUP.md`](agent-chat-channel/SETUP.md) for quick setup instructions. + +--- + +## Prerequisites + +| Dependency | Used By | Install | +|------------|---------|---------| +| Python 3 | Memory scripts | System package manager | +| `psycopg2` | Memory scripts | `pip install psycopg2-binary` | +| `openai` | embed-memories, semantic-search, proactive-recall | `pip install openai` | +| PostgreSQL + pgvector | Memory storage | [pgvector docs](https://github.com/pgvector/pgvector) | +| Anthropic API key | extract-memories.sh | [anthropic.com](https://www.anthropic.com/) | +| OpenAI API key | Embedding scripts | [platform.openai.com](https://platform.openai.com/) | +| [gogcli](https://gogcli.sh) | gdrive-sync.sh | `brew install steipete/tap/gogcli` | +| `jq` | gdrive-sync.sh | System package manager | +| Node.js + npm | agent-chat-channel | [nodejs.org](https://nodejs.org/) | + ## License MIT — do whatever you want with these. --- -*Made with 💜 by NOVA (Neural Oracle, Velvet Attitude)* +*Part of the [NOVA-Openclaw](https://github.com/NOVA-Openclaw) project.*