diff --git a/README.md b/README.md index c116b34..7d102f0 100644 --- a/README.md +++ b/README.md @@ -64,15 +64,17 @@ --- -## Quick start (3 lines) +## Quick start ```bash -uv tool install code-context-engine +uv tool install "code-context-engine[local]" # or: pipx install "code-context-engine[local]" cd /path/to/your/project -cce init # or: cce init --agent all +cce init # or: cce init --agent all ``` -That's it. Your AI coding agent now searches your index instead of reading entire files. No config needed. +That's it. Your AI coding agent now searches your index instead of reading entire files. + +> **Already have Ollama?** You can skip `[local]` and use `uv tool install code-context-engine` instead. CCE auto-detects Ollama at localhost:11434 and uses `nomic-embed-text`. --- @@ -92,16 +94,18 @@ Tested on all three platforms in CI (macOS, Linux, Windows × Python 3.11/3.12/3 ## Install and see savings in 60 seconds -```bash -uv tool install code-context-engine # or: pipx install code-context-engine -cd /path/to/your/project -cce init # index, install hooks, register MCP server -``` +You need an embedding backend to index code. Pick one: -**Embedding backends:** CCE auto-detects the best available backend. If you have Ollama running, it uses `nomic-embed-text` with zero extra dependencies. For offline/local embedding without Ollama, install the `[local]` extra: +| Option | Install command | Size | Requires | +|--------|----------------|------|----------| +| **Local (recommended)** | `uv tool install "code-context-engine[local]"` | +60 MB | Nothing else | +| **Ollama** | `uv tool install code-context-engine` | Core only | Ollama running + `nomic-embed-text` pulled | + +Then: ```bash -uv tool install "code-context-engine[local]" # includes fastembed + ONNX Runtime +cd /path/to/your/project +cce init # index, install hooks, register MCP server ``` Restart your editor. Done. Every question now hits the index instead of re-reading files. @@ -449,16 +453,18 @@ No. Quality stays the same or slightly improves. CCE replaces "dump the entire file" with "search for the relevant function." The model still gets the code it needs (0.90 Recall@10 in benchmarks). Less irrelevant context means less noise competing for attention, which can improve the model's focus on your actual question. -### How do I increase output token savings? +### How does output token savings work? + +CCE writes output compression rules directly into your agent's instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) during `cce init`. These rules apply to the **entire session**, not just CCE tool responses, so every reply from the agent follows them. -Set the output compression level in your project config (`cce.yaml`): +Set the level in `cce.yaml`: ```yaml compression: output: max # off | lite | standard | max ``` -Or change it at runtime via the MCP tool: +Then re-run `cce init` to update instruction files. Or change at runtime: ``` set_output_level output_level=max @@ -471,7 +477,7 @@ set_output_level output_level=max | `standard` | ~70% | Drops articles, fragments, short synonyms + diff-only for code | | `max` | ~80% | Telegraphic style + diff-only for code | -Default is `standard`. All levels include **code output rules** that instruct the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The `max` level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level. +Default is `standard`. All levels include **code output rules** that tell the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The `max` level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level. ### Where do the savings come from? diff --git a/docs-src/astro.config.mjs b/docs-src/astro.config.mjs index 7cfd597..44a222f 100644 --- a/docs-src/astro.config.mjs +++ b/docs-src/astro.config.mjs @@ -14,6 +14,7 @@ export default defineConfig({ ], sidebar: [ { slug: 'introduction' }, + { slug: 'why-cce' }, { slug: 'getting-started' }, { label: 'Agent Setup', diff --git a/docs-src/src/content/docs/configuration.md b/docs-src/src/content/docs/configuration.md index 56e62cb..80b9978 100644 --- a/docs-src/src/content/docs/configuration.md +++ b/docs-src/src/content/docs/configuration.md @@ -57,22 +57,21 @@ Controls how much CCE compresses code chunks before including them in the agent' ### Output compression (`compression.output`) -Controls how verbose the agent's responses are. Set via the `set_output_compression` MCP tool or config. +Controls how verbose the agent's responses are. During `cce init`, the configured level is written into instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) so it applies to the **entire session**, not just CCE tool responses. | Level | Style | Typical savings | |-------|-------|----------------| | `off` | Full output | 0% | -| `lite` | Removes filler and hedging | ~30% | -| `standard` | Shorter phrasing, fragments where possible | ~65% | -| `max` | Telegraphic, minimal prose | ~75% | +| `lite` | No filler/hedging, diff-only code | ~25% | +| `standard` | Fragments, short synonyms, diff-only code | ~70% | +| `max` | Telegraphic, abbreviations, diff-only code | ~80% | -Code blocks, file paths, commands, and error messages are never compressed regardless of level. +All levels include code output rules: show only changed lines, never rewrite entire files, never echo back unchanged code. Code blocks, paths, commands, and error messages are never compressed. Security warnings use full clarity. -Change at runtime by telling your agent: +Change the level and re-run `cce init` to update instruction files, or change at runtime: ``` -Switch to max output compression -Turn off output compression +set_output_level output_level=max ``` ## Embedding model diff --git a/docs-src/src/content/docs/faq.md b/docs-src/src/content/docs/faq.md index 66a1893..686bc4b 100644 --- a/docs-src/src/content/docs/faq.md +++ b/docs-src/src/content/docs/faq.md @@ -7,24 +7,45 @@ description: Frequently asked questions about Code Context Engine. No. CCE returns the same code your agent would find by reading files, just compressed and targeted. In practice, answers are often better because the agent receives focused, relevant context instead of entire files full of unrelated code. -## How can I increase output savings? +## How does output token savings work? -Set output compression to a higher level: +CCE writes output compression rules directly into your agent's instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) during `cce init`. These apply to the **entire session**, so every response follows them. + +Set the level in `cce.yaml`, then re-run `cce init`: ```yaml compression: - output: max + output: max # off | lite | standard | max +``` + +Or change at runtime via the MCP tool: + ``` +set_output_level output_level=max +``` + +| Level | Savings | Style | +|-------|---------|-------| +| `off` | 0% | Normal verbosity | +| `lite` | ~25% | No filler/hedging, diff-only code | +| `standard` | ~70% | Fragments, short synonyms, diff-only code | +| `max` | ~80% | Telegraphic, abbreviations, diff-only code | -Or tell your agent at runtime: "Switch to max output compression." The `max` level uses telegraphic phrasing and typically saves ~75% on response tokens. Code blocks and file paths are never affected. +Default is `standard`. All levels include code output rules that tell the model to show only changed lines instead of full file rewrites. Code blocks, paths, and commands are never compressed. Security warnings use full clarity. ## Where do the savings come from? -Three main sources: +**Input tokens** (what goes into the model): -1. **Retrieval.** Only relevant chunks are returned instead of the full codebase. This is the largest contributor (often 80%+ reduction). +1. **Retrieval.** Only relevant chunks are returned instead of the full codebase. This is the largest contributor (often 94% reduction). 2. **Chunk compression.** Retrieved chunks are truncated to signatures and docstrings, or summarized via Ollama if available. -3. **Output compression.** Agent responses are shortened by removing filler, hedging, and verbose phrasing. +3. **Grammar compression.** Articles and filler removed from context. +4. **Turn summarization.** Session history compressed. +5. **Progressive disclosure.** Tool payloads filtered. + +**Output tokens** (what comes back from the model): + +6. **Output compression.** Session-wide style directives in instruction files reduce prose verbosity and enforce diff-only code changes. Output tokens cost 5x more than input (e.g. Opus: $75/1M vs $15/1M), so even moderate output savings have outsized cost impact. ## Is my code sent anywhere? @@ -57,6 +78,19 @@ CCE uses Tree-sitter for structural parsing. The following languages have full A Other file types (YAML, Markdown, config files, etc.) are indexed using line-based chunking. They still appear in search results but without function-level granularity. +## Why does `cce init` fail with "No embedding backend available"? + +CCE needs an embedding backend to convert code into searchable vectors. You have two options: + +1. **Install with `[local]` extra** (recommended): `uv tool install "code-context-engine[local]"`. This includes fastembed, which works offline with no external services. +2. **Use Ollama**: Start Ollama and run `ollama pull nomic-embed-text`. Then install CCE without `[local]`: `uv tool install code-context-engine`. + +If you installed without `[local]` and don't have Ollama running, re-install with the extra: + +```bash +uv tool install --force "code-context-engine[local]" +``` + ## Can I use CCE with multiple agents at once? Yes. Run `cce init --agent all` to configure every supported agent. They all share the same index and MCP server, so there is no duplication or conflict. diff --git a/docs-src/src/content/docs/getting-started.md b/docs-src/src/content/docs/getting-started.md index 58a50ce..3e54535 100644 --- a/docs-src/src/content/docs/getting-started.md +++ b/docs-src/src/content/docs/getting-started.md @@ -17,22 +17,23 @@ description: Install CCE and start saving tokens in under a minute ## Install -```bash -uv tool install code-context-engine -``` - -Or with pipx: +CCE needs an embedding backend to index your code. Pick one: -```bash -pipx install code-context-engine -``` +| Option | Install command | What it needs | +|--------|----------------|---------------| +| **Local (recommended)** | `uv tool install "code-context-engine[local]"` | Nothing else. Includes fastembed + ONNX Runtime (~60 MB download on first run). | +| **Ollama** | `uv tool install code-context-engine` | Ollama running at localhost:11434 with `nomic-embed-text` pulled. | -### Optional: Local embedding (no Ollama) +Using pipx instead of uv: ```bash -uv tool install "code-context-engine[local]" # includes fastembed + ONNX Runtime +pipx install "code-context-engine[local]" ``` +:::caution +Installing without `[local]` and without Ollama running will cause `cce init` to fail with "No embedding backend available." Always pick one of the two options above. +::: + ## Initialize your project ```bash @@ -41,11 +42,11 @@ cce init ``` This does everything: -- Detects your embedding backend (Ollama or fastembed) +- Detects your embedding backend (fastembed or Ollama) - Builds vector, FTS, and graph indexes - Installs git hooks (auto-updates index on commit) - Writes MCP config for detected editors -- Creates instruction files +- Creates instruction files with output compression rules ### Target a specific agent @@ -79,12 +80,12 @@ cce savings ## Embedding backends -CCE auto-detects the best available backend: +CCE auto-detects the best available backend at init time: -1. **Ollama** (preferred) — If running at localhost:11434, uses `nomic-embed-text`. Zero extra dependencies. -2. **fastembed** — Install with `[local]` extra. Uses `BAAI/bge-small-en-v1.5`. Works offline, ~60 MB download. +1. **fastembed** (with `[local]` extra) — Uses `BAAI/bge-small-en-v1.5`. Works offline, no external services needed. ~60 MB model downloaded on first run. +2. **Ollama** — If running at localhost:11434 with `nomic-embed-text` pulled. Zero extra Python dependencies. -Set `CCE_EMBED_BACKEND=ollama` or `CCE_EMBED_BACKEND=fastembed` to force a specific backend. +Force a specific backend with `CCE_EMBED_BACKEND=fastembed` or `CCE_EMBED_BACKEND=ollama`. ## Next steps diff --git a/docs-src/src/content/docs/index.mdx b/docs-src/src/content/docs/index.mdx index 926eb75..0e5ea1b 100644 --- a/docs-src/src/content/docs/index.mdx +++ b/docs-src/src/content/docs/index.mdx @@ -10,7 +10,11 @@ hero: link: /code-context-engine/guide/getting-started/ icon: right-arrow variant: primary - - text: View on GitHub + - text: Why CCE? + link: /code-context-engine/guide/why-cce/ + icon: information + variant: minimal + - text: GitHub link: https://github.com/elara-labs/code-context-engine icon: external variant: minimal diff --git a/docs-src/src/content/docs/introduction.md b/docs-src/src/content/docs/introduction.md index c79b69b..7f8d168 100644 --- a/docs-src/src/content/docs/introduction.md +++ b/docs-src/src/content/docs/introduction.md @@ -45,5 +45,5 @@ CCE parses your code into semantic chunks (functions, classes, modules) using Tr 1. **Index** — Tree-sitter parses code into semantic chunks. Stored locally with vector embeddings. 2. **Search** — Agent calls `context_search` via MCP. Hybrid vector + BM25 merged with Reciprocal Rank Fusion. Graph expansion adds related imports. -3. **Compress** — Chunks are compressed (truncation or LLM summary with Ollama). Output compression reduces reply tokens. +3. **Compress** — Chunks are compressed (truncation or LLM summary with Ollama). Session-wide output compression rules in instruction files reduce reply tokens (diff-only code, no filler). 4. **Track** — Every query recorded. `cce savings` shows tokens and dollars saved. diff --git a/docs-src/src/content/docs/savings-tracking.md b/docs-src/src/content/docs/savings-tracking.md index 98da24c..b42a310 100644 --- a/docs-src/src/content/docs/savings-tracking.md +++ b/docs-src/src/content/docs/savings-tracking.md @@ -30,28 +30,36 @@ Example output: ## Understanding the input/output split -Savings come from two independent stages: +The report separates input and output token savings because they have different pricing. Output tokens cost 5x more than input (e.g. Opus: $75/1M output vs $15/1M input). -- **Retrieval savings (input).** Instead of sending the entire codebase, CCE returns only the chunks relevant to the query. This is measured as: `1 - (served_tokens / full_codebase_tokens)`. +**Input savings** come from: -- **Compression savings (input).** The retrieved chunks are further compressed (truncation, summarization) before being sent to the agent. This is measured as: `1 - (compressed_tokens / raw_chunk_tokens)`. +- **Retrieval.** Only relevant chunks returned instead of full files (biggest contributor, often 94%). +- **Chunk compression.** Chunks truncated to signatures/docstrings or summarized via Ollama. +- **Grammar compression.** Articles and filler removed from context. +- **Turn summarization.** Session history compressed. +- **Progressive disclosure.** Tool payloads filtered. -The combined effect is multiplicative. If retrieval cuts 90% and compression cuts another 50%, the total savings are 95%. +**Output savings** come from: + +- **Output compression.** Session-wide style directives written into instruction files (`CLAUDE.md`, `AGENTS.md`, etc.) during `cce init`. These tell the agent to use compressed prose and diff-only code changes across the entire session. Configure the level in `cce.yaml` (`compression.output`: off/lite/standard/max). ## Per-bucket breakdown -The `How:` line in the output shows the contribution of each stage: +The breakdown shows each savings layer with its contribution: ``` -How: retrieval 93% + compression 90% + Breakdown: + retrieval 48% ▰▰▰▰▰▰▰▰▰▰ 6.0k $0.09 · 1 call + chunk compression 20% ▰▰▰▰▱▱▱▱▱▱ 2.6k $0.04 · 1 call + output compression* 2% ▰▱▱▱▱▱▱▱▱▱ 325 $0.02 · 1 call ``` -- **retrieval** represents the savings from selecting only relevant chunks. -- **compression** represents the savings from compressing those chunks. +Each row uses the correct pricing (input rate for input buckets, output rate for the output compression bucket). Buckets marked with `*` use estimated values. ## Configuring the pricing model -Cost estimates use model-specific input pricing. Configure which model to estimate for: +Cost estimates use model-specific pricing for both input and output tokens. Configure which model to estimate for: ```yaml # ~/.cce/config.yaml or .context-engine.yaml diff --git a/docs-src/src/content/docs/why-cce.md b/docs-src/src/content/docs/why-cce.md new file mode 100644 index 0000000..98b3fc0 --- /dev/null +++ b/docs-src/src/content/docs/why-cce.md @@ -0,0 +1,81 @@ +--- +title: Why CCE? +description: The problem CCE solves, with real numbers, and who benefits most. +--- + +## The problem + +Every time an AI coding agent answers a question about your code, it reads entire files. A 200-line file costs 200 lines of input tokens even when the agent only needs one function. Across a session with 20 queries, this adds up fast. + +**Real numbers from a FastAPI project (53 source files, 180K tokens):** + +| | Without CCE | With CCE | +|---|---|---| +| Tokens per query (avg) | 83,681 | 523 | +| Cost per query (Opus) | $1.25 | $0.008 | +| Cost for 20 queries | $25.00 | $0.16 | +| Tokens wasted on irrelevant code | ~95% | ~0% | + +That's $25 vs $0.16 for the same 20 questions. The agent gets the same answers both ways. The difference is how much irrelevant code it had to read to find them. + +## Why this happens + +AI coding agents are designed to be thorough. When you ask "how does authentication work?", the agent reads every file that might be relevant. Most of those files contain code that has nothing to do with authentication, but the agent reads them anyway because it can't know in advance which lines matter. + +This is the right behavior for correctness. But it's wasteful for cost. You're paying for the agent to read thousands of lines of code it immediately ignores. + +## What CCE does differently + +CCE sits between your agent and your codebase as an MCP server. Instead of the agent reading files directly, it calls `context_search("authentication")` and gets back only the relevant functions, classes, and modules. + +**Three layers of savings:** + +1. **Retrieval (94% input savings).** Tree-sitter parses your code into semantic chunks (functions, classes, imports). Vector + keyword search finds the relevant ones. The agent gets 500 tokens of focused code instead of 80,000 tokens of full files. + +2. **Compression (up to 89% additional).** Retrieved chunks are compressed to signatures and docstrings (or LLM-summarized via Ollama). If the agent needs the full source, it calls `expand_chunk`. + +3. **Output compression (up to 80% output savings).** Session-wide style directives in your instruction files tell the agent to use compressed prose and show only code diffs instead of full file rewrites. Output tokens cost 5x more than input tokens on Opus ($75/1M vs $15/1M), so this has outsized cost impact. + +## The memory problem + +Without CCE, every agent session starts from zero. The agent doesn't know what you decided yesterday, what architecture choices you made last week, or what code areas you've been working in. You end up re-explaining context every session. + +CCE adds cross-session memory: + +- **`record_decision`** stores architectural choices ("we chose PostgreSQL over MongoDB because...") +- **`record_code_area`** marks files you've worked on with descriptions +- **`session_recall`** retrieves past decisions at the start of new sessions + +The agent stops re-deriving answers it already figured out. Decisions compound instead of being forgotten. + +## Who benefits most + +**Large codebases.** The more files in your project, the more tokens wasted reading irrelevant code. A 500-file project wastes far more than a 20-file project. + +**Opus users.** Opus input tokens cost $15/1M, output $75/1M. A 94% reduction in input and 70% reduction in output saves real money. Sonnet and Haiku users save less in absolute dollars but still benefit from faster responses (fewer tokens = faster inference). + +**Multi-agent users.** If you use Claude Code, Cursor, and Codex on the same project, the index is shared. One `cce init --agent all` configures everything. Without CCE, each agent independently reads the same files and wastes the same tokens. + +**Teams.** Decisions recorded by one developer are recalled by another. The codebase's institutional knowledge lives in the index, not in individual developers' heads. + +## What CCE is NOT + +**Not a prompt optimizer.** CCE doesn't rewrite your prompts or modify your agent's system prompt. It provides a search tool and writes output style rules into instruction files. + +**Not cloud-based.** Everything runs on your machine. No code, embeddings, or queries leave your system. The only network call is fetching model pricing for cost estimates (cached 7 days). + +**Not a replacement for your agent's tools.** When you need to edit a specific file, use your agent's built-in file editor. CCE handles search and context retrieval. Use `context_search` for understanding code, use `Read`/`Edit` for modifying it. + +**Not language-limited.** Full AST-aware chunking works for Python, JavaScript, TypeScript, PHP, Go, Rust, and Java. Other file types (YAML, Markdown, config) use line-based chunking and still appear in search results. + +## The 60-second test + +```bash +uv tool install "code-context-engine[local]" +cd /path/to/your/project +cce init +``` + +Ask your agent a question. Then run `cce savings` to see exactly how many tokens and dollars CCE saved. If the numbers don't convince you, run `cce uninstall` to remove everything cleanly. + +> Already have Ollama running? Use `uv tool install code-context-engine` (without `[local]`) instead. diff --git a/docs/demo.gif b/docs/demo.gif index 4a36308..043177b 100644 Binary files a/docs/demo.gif and b/docs/demo.gif differ diff --git a/docs/demo.tape b/docs/demo.tape new file mode 100644 index 0000000..9c5e17b --- /dev/null +++ b/docs/demo.tape @@ -0,0 +1,43 @@ +# CCE Demo Recording +# Run from project root: vhs docs/demo.tape + +Output docs/demo.gif +Set Shell "zsh" +Set FontSize 14 +Set Width 900 +Set Height 550 +Set Theme "Catppuccin Mocha" +Set TypingSpeed 35ms +Set Padding 16 + +# Title +Type "# Code Context Engine — save 94% on AI coding tokens" +Enter +Sleep 1s + +# Show the welcome banner +Type "cce" +Enter +Sleep 3s + +# Show savings with input/output split +Type "cce savings" +Enter +Sleep 4s + +# Show all available commands +Type "cce list" +Enter +Sleep 4s + +# Show multi-agent init +Type "cce init --agent all --help" +Enter +Sleep 3s + +# Check for updates +Type "cce upgrade --check" +Enter +Sleep 3s + +Sleep 1s diff --git a/docs/guide/agents/claude/index.html b/docs/guide/agents/claude/index.html index 1bedf0a..9ae8684 100644 --- a/docs/guide/agents/claude/index.html +++ b/docs/guide/agents/claude/index.html @@ -45,7 +45,7 @@ })();
GitHub