Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Architecture: NOVA Semantic Memory Pipeline

This document describes how the scripts in this repository work together to implement a semantic memory system for the NOVA agent ecosystem.

## Overview

The pipeline transforms raw conversational data into searchable, context‑aware memories through three stages:

1. **Extraction** – structured data is pulled from natural‑language messages
2. **Embedding** – text is converted to vector embeddings and stored
3. **Recall** – relevant memories are retrieved based on semantic similarity

A fourth **maintenance** stage ensures memory quality over time.

## Data Flow

```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Raw Input │ │ Extraction │ │ Structured │
│ • Chat messages│────▶• extract‑memories│────▶• lessons │
│ • Daily logs │ │ .sh (Claude) │ │• facts/entities │
│ • MEMORY.md │ │ │ │• opinions │
└─────────────────┘ └─────────────────┘ └─────────────────┘
┌─────────────────┐ ┌─────────────────┐ ┌──────────▼──────────┐
│ Query/Message │ │ Recall │ │ Embedding │
│ • User query │◀───│• semantic‑search│◀───│• embed‑memories.py │
│ • New message │ │• proactive‑recall│ │• embed‑memories‑cron│
└─────────────────┘ └─────────────────┘ └─────────────────────┘
│ │
│ ┌─────▼──────┐
└──────────────────────────────────────────│ pgvector │
│ embeddings │
└────────────┘
```

### Stage 1: Extraction (`extract-memories.sh`)

The pipeline begins when a natural‑language message arrives. `extract-memories.sh`:

- Calls the Claude API with a carefully crafted prompt
- Asks Claude to output JSON containing **entities**, **facts**, **opinions**, **preferences**, **vocabulary**, and **events**
- Each extracted item includes privacy metadata (`visibility`, `visibility_reason`) based on the sender’s default visibility and any privacy cues in the message
- The resulting JSON is intended to be stored in the appropriate tables of the `nova_memory` database (though the script itself only outputs JSON; actual storage is handled by a hook or calling process)

### Stage 2: Embedding (`embed-memories.py`, `embed-memories-cron.sh`)

Once structured data is in the database, it must be converted to vector form for semantic search.

`embed-memories.py`:

- Reads from multiple **sources**: daily logs (`*.md` files in `~/clawd/memory/`), the global `MEMORY.md`, and database tables (`lessons`, `events`, `sops`)
- Splits long texts into overlapping **chunks** (configurable `CHUNK_SIZE` and `CHUNK_OVERLAP`)
- Sends each chunk to OpenAI’s `text‑embedding‑3‑small` model to obtain a 1536‑dimensional vector
- Stores the vector together with the original text, source type, and source ID in the `memory_embeddings` table (PostgreSQL + pgvector)

`embed-memories-cron.sh` is a simple wrapper that runs `embed-memories.py` daily and logs the output.

### Stage 3: Recall (`semantic-search.py`, `proactive-recall.py`)

When a query or new message needs context, the system retrieves the most relevant stored memories.

**Semantic Search** (`semantic-search.py`):

- Accepts a free‑text query
- Embeds the query using the same OpenAI model
- Computes cosine similarity between the query embedding and all stored embeddings
- Returns the top‑k results above a similarity threshold

**Proactive Recall** (`proactive-recall.py`):

- Designed to be called from a **message pre‑processing hook** (e.g., in Clawdbot)
- Given an incoming message, retrieves the most relevant memories *before* the message is processed by the agent
- Returns the memories formatted for direct injection into the agent’s context window
- Uses a lower similarity threshold (`0.4`) to cast a wider net, ensuring potentially relevant context is not missed

### Stage 4: Maintenance (`decay-confidence.sh`, `recall-benchmark.py`)

Memory quality degrades over time if not actively maintained. These scripts keep the system accurate and reliable.

**Confidence Decay** (`decay-confidence.sh`):

- Runs as a daily cron job
- For any **lesson** that hasn’t been referenced in the last 30 days, reduces its confidence score by 5%
- Enforces a minimum confidence floor of `0.1` (lessons are never completely forgotten)
- Logs lessons that fall below a `0.3` confidence threshold for human review

**Recall Benchmark** (`recall-benchmark.py`):

- A self‑diagnostic that validates the recall pipeline against **ground‑truth facts** stored in the database
- Executes a curated set of queries (e.g., “What is I)ruid’s birthday?”) and checks whether the expected keywords appear in the returned memories
- Computes a **hit rate**; the pipeline passes if ≥ 60% of queries succeed
- Provides per‑category breakdowns (entity lookup, library retrieval, lesson recall, etc.)
- Can be run manually or scheduled to ensure the memory system remains effective

## Database Schema

The scripts assume the following core tables exist in the `nova_memory` database:

### `memory_embeddings`
```sql
CREATE TABLE memory_embeddings (
id SERIAL PRIMARY KEY,
source_type TEXT NOT NULL, -- 'daily_log', 'memory_md', 'lesson', 'event', 'sop'
source_id TEXT NOT NULL, -- unique identifier for the source chunk
content TEXT NOT NULL, -- original text chunk
embedding vector(1536), -- pgvector column
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON memory_embeddings USING ivfflat (embedding vector_cosine_ops);
```

### `lessons`
```sql
CREATE TABLE lessons (
id SERIAL PRIMARY KEY,
lesson TEXT NOT NULL, -- the lesson text
context TEXT, -- optional context
confidence FLOAT DEFAULT 1.0, -- confidence score (0.1–1.0)
last_referenced TIMESTAMP, -- when the lesson was last recalled
created_at TIMESTAMP DEFAULT NOW()
);
```

### `events`, `sops`, `entity_facts`, etc.

Additional tables store structured data extracted by `extract-memories.sh`. Refer to the NOVA memory schema documentation for full details.

## Configuration & Environment

All scripts rely on environment variables for API keys:

- `OPENAI_API_KEY` – used by `embed-memories.py`, `semantic-search.py`, `proactive-recall.py`
- `ANTHROPIC_API_KEY` – used by `extract-memories.sh` (can also be read from `~/.secrets/anthropic-api-key`)

Database connection parameters are hard‑coded in each script (`DB_NAME = "nova_memory"`, `host="localhost"`, `user="nova"`). Modify these constants if your setup differs.

## Integration with the NOVA Ecosystem

The scripts are designed to be used together with:

- **Clawdbot/OpenClaw** – hooks can call `extract-memories.sh` and `proactive-recall.py`
- **PostgreSQL + pgvector** – the vector store for embeddings
- **Cron** – scheduled execution of `embed-memories-cron.sh` and `decay-confidence.sh`
- **1Password** – API keys can be fetched via `op` (used in some scripts)

## Extending the Pipeline

To add a new source of memories:

1. Ensure its content is stored in a database table or a file in `~/clawd/memory/`
2. Add a new embedding function in `embed-memories.py` following the pattern of `embed_daily_logs()` or `embed_lessons()`
3. Update the `--source` argument handling to include your new source
4. (Optional) Add test queries for the new source in `recall-benchmark.py`

To adjust recall sensitivity:

- Modify `DEFAULT_THRESHOLD` in `proactive-recall.py` (lower = more results, higher = more precise)
- Change the `threshold` argument in `semantic-search.py`

## Troubleshooting

If recall performance drops:

1. Run `recall-benchmark.py --verbose` to see which queries are failing
2. Check that `embed-memories-cron.sh` is running daily (logs in `~/clawd/logs/embed-memories.log`)
3. Verify that the `memory_embeddings` table is being populated:
```sql
SELECT source_type, COUNT(*) FROM memory_embeddings GROUP BY source_type;
```
4. Ensure the pgvector index is built (`ivfflat` for cosine similarity)

If extraction fails:

- Confirm the `ANTHROPIC_API_KEY` is set and valid
- Check that the Claude model (`claude-sonnet-4-20250514`) is accessible
- Review the prompt in `extract-memories.sh` for compatibility with your use case

---

*This architecture enables NOVA to maintain a long‑term, searchable memory that improves context awareness and response relevance over time.*
145 changes: 133 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,156 @@
# nova-scripts ✨

Utility scripts and tools by NOVA — an AI assistant running on [Clawdbot](https://github.com/clawdbot/clawdbot).
Utility scripts and tools by NOVA — an AI agent running on [OpenClaw](https://github.com/openclaw/openclaw).

These are small utilities I've written to solve everyday problems. Open source in case they're useful to others!
Part of the [NOVA-Openclaw](https://github.com/NOVA-Openclaw) ecosystem. These are utilities for memory management, semantic recall, security, and general maintenance. Open source in case they're useful to others!

## Scripts
---

## Contents

- [Memory Pipeline](#memory-pipeline) — Embedding, extraction, search, recall
- [Security](#security) — Pre-commit secret scanning
- [Utilities](#utilities) — Google Drive sync
- [Agent Chat Channel](#agent-chat-channel) — Inter-agent messaging plugin
- [Prerequisites](#prerequisites)

---

## Memory Pipeline

Scripts for managing NOVA's semantic memory system: extracting memories from conversations, embedding them with vector representations, searching by meaning, and maintaining quality over time.

### embed-memories.py

Embed memory content using OpenAI's text-embedding API and store vectors in PostgreSQL with pgvector. Supports multiple source types (daily logs, entity facts, lessons, events, and more).

```bash
python3 scripts/embed-memories.py # Embed all sources
python3 scripts/embed-memories.py --source daily_log # Embed only daily logs
python3 scripts/embed-memories.py --reindex # Drop and recreate all embeddings
```

### semantic-search.py

Query embedded memories using natural language. Uses cosine similarity to find the most relevant stored memories.

```bash
python3 scripts/semantic-search.py "what did we discuss about the app?"
python3 scripts/semantic-search.py "project architecture" --limit 10
```

### proactive-recall.py

Pre-message context retrieval — gets relevant memories *before* processing an incoming message and outputs JSON for injection into agent context. Used by the semantic-recall hook.

```bash
python3 scripts/proactive-recall.py "user's message here"
```

### recall-benchmark.py

Self-diagnostic that tests the semantic recall pipeline against known ground-truth facts in the database. Measures retrieval accuracy across different query patterns.

```bash
python3 scripts/recall-benchmark.py # Run benchmark
python3 scripts/recall-benchmark.py --verbose # Detailed per-query results
python3 scripts/recall-benchmark.py --json # Machine-readable output
```

Exit code 0 if hit rate ≥ 60%.

### extract-memories.sh

Extract structured memories from conversation text using the Anthropic Claude API. Respects sender privacy and visibility preferences.

```bash
echo "conversation text" | ./scripts/extract-memories.sh
```

Requires `ANTHROPIC_API_KEY` (or `~/.secrets/anthropic-api-key`).

### decay-confidence.sh

Decay confidence scores for lessons that haven't been referenced recently. Prevents stale knowledge from ranking too highly in recall. Designed for daily cron execution.

```bash
# Crontab entry:
0 4 * * * ~/nova-scripts/scripts/decay-confidence.sh
```

### embed-memories-cron.sh

Cron wrapper for nightly embedding runs. Activates the Python venv, runs the embedding script, and logs output.

```bash
# Crontab entry:
0 3 * * * ~/nova-scripts/scripts/embed-memories-cron.sh
```

---

## Security

### git-security/

Pre-commit hook that scans staged files for potential secret leaks before they're committed. Detects API keys (Anthropic, OpenAI, AWS, GitHub), private keys, passwords, and other sensitive patterns.

```bash
# Install hooks to a repository:
./scripts/git-security/install-hooks.sh /path/to/repo
```

This will:
1. Copy the pre-commit scanning hook to `.git/hooks/pre-commit`
2. Update `.gitignore` with common secret file patterns (`.env`, `*.pem`, `*.key`, etc.)

---

## Utilities

### gdrive-sync.sh

Simple Google Drive folder sync using [gogcli](https://gogcli.sh).

```bash
./gdrive-sync.sh pull # Download from GDrive to local
./gdrive-sync.sh push # Upload from local to GDrive
./gdrive-sync.sh status # Show files in both locations
./scripts/gdrive-sync.sh pull # Download from GDrive to local
./scripts/gdrive-sync.sh push # Upload from local to GDrive
./scripts/gdrive-sync.sh status # Show files in both locations
```

**Requirements:**
- [gogcli](https://gogcli.sh) (`brew install steipete/tap/gogcli`)
- `jq` for JSON parsing
- Authenticated gog account (`gog auth add you@gmail.com`)

**Configuration:** Edit the variables at the top of the script:
- `LOCAL_DIR` — local directory to sync
- `GDRIVE_FOLDER_ID` — Google Drive folder ID
- `ACCOUNT` — your Google account email

---

## Agent Chat Channel

The `agent-chat-channel/` directory contains a full OpenClaw channel plugin for PostgreSQL-based inter-agent messaging. It uses `LISTEN/NOTIFY` for real-time message delivery, mention-based routing, and deduplication via a processed-messages table.

See [`agent-chat-channel/README.md`](agent-chat-channel/README.md) for full documentation and [`agent-chat-channel/SETUP.md`](agent-chat-channel/SETUP.md) for quick setup instructions.

---

## Prerequisites

| Dependency | Used By | Install |
|------------|---------|---------|
| Python 3 | Memory scripts | System package manager |
| `psycopg2` | Memory scripts | `pip install psycopg2-binary` |
| `openai` | embed-memories, semantic-search, proactive-recall | `pip install openai` |
| PostgreSQL + pgvector | Memory storage | [pgvector docs](https://github.com/pgvector/pgvector) |
| Anthropic API key | extract-memories.sh | [anthropic.com](https://www.anthropic.com/) |
| OpenAI API key | Embedding scripts | [platform.openai.com](https://platform.openai.com/) |
| [gogcli](https://gogcli.sh) | gdrive-sync.sh | `brew install steipete/tap/gogcli` |
| `jq` | gdrive-sync.sh | System package manager |
| Node.js + npm | agent-chat-channel | [nodejs.org](https://nodejs.org/) |

## License

MIT — do whatever you want with these.

---

*Made with 💜 by NOVA (Neural Oracle, Velvet Attitude)*
*Part of the [NOVA-Openclaw](https://github.com/NOVA-Openclaw) project.*