Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 21 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,17 @@

---

## Quick start (3 lines)
## Quick start

```bash
uv tool install code-context-engine
uv tool install "code-context-engine[local]" # or: pipx install "code-context-engine[local]"
cd /path/to/your/project
cce init # or: cce init --agent all
cce init # or: cce init --agent all
```

That's it. Your AI coding agent now searches your index instead of reading entire files. No config needed.
That's it. Your AI coding agent now searches your index instead of reading entire files.

> **Already have Ollama?** You can skip `[local]` and use `uv tool install code-context-engine` instead. CCE auto-detects Ollama at localhost:11434 and uses `nomic-embed-text`.

---

Expand All @@ -92,16 +94,18 @@ Tested on all three platforms in CI (macOS, Linux, Windows × Python 3.11/3.12/3

## Install and see savings in 60 seconds

```bash
uv tool install code-context-engine # or: pipx install code-context-engine
cd /path/to/your/project
cce init # index, install hooks, register MCP server
```
You need an embedding backend to index code. Pick one:

**Embedding backends:** CCE auto-detects the best available backend. If you have Ollama running, it uses `nomic-embed-text` with zero extra dependencies. For offline/local embedding without Ollama, install the `[local]` extra:
| Option | Install command | Size | Requires |
|--------|----------------|------|----------|
| **Local (recommended)** | `uv tool install "code-context-engine[local]"` | +60 MB | Nothing else |
| **Ollama** | `uv tool install code-context-engine` | Core only | Ollama running + `nomic-embed-text` pulled |

Then:

```bash
uv tool install "code-context-engine[local]" # includes fastembed + ONNX Runtime
cd /path/to/your/project
cce init # index, install hooks, register MCP server
```

Restart your editor. Done. Every question now hits the index instead of re-reading files.
Expand Down Expand Up @@ -449,16 +453,18 @@ No. Quality stays the same or slightly improves.

CCE replaces "dump the entire file" with "search for the relevant function." The model still gets the code it needs (0.90 Recall@10 in benchmarks). Less irrelevant context means less noise competing for attention, which can improve the model's focus on your actual question.

### How do I increase output token savings?
### How does output token savings work?

CCE writes output compression rules directly into your agent's instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) during `cce init`. These rules apply to the **entire session**, not just CCE tool responses, so every reply from the agent follows them.

Set the output compression level in your project config (`cce.yaml`):
Set the level in `cce.yaml`:

```yaml
compression:
output: max # off | lite | standard | max
```

Or change it at runtime via the MCP tool:
Then re-run `cce init` to update instruction files. Or change at runtime:

```
set_output_level output_level=max
Expand All @@ -471,7 +477,7 @@ set_output_level output_level=max
| `standard` | ~70% | Drops articles, fragments, short synonyms + diff-only for code |
| `max` | ~80% | Telegraphic style + diff-only for code |

Default is `standard`. All levels include **code output rules** that instruct the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The `max` level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level.
Default is `standard`. All levels include **code output rules** that tell the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The `max` level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level.

### Where do the savings come from?

Expand Down
1 change: 1 addition & 0 deletions docs-src/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ export default defineConfig({
],
sidebar: [
{ slug: 'introduction' },
{ slug: 'why-cce' },
{ slug: 'getting-started' },
{
label: 'Agent Setup',
Expand Down
15 changes: 7 additions & 8 deletions docs-src/src/content/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,22 +57,21 @@ Controls how much CCE compresses code chunks before including them in the agent'

### Output compression (`compression.output`)

Controls how verbose the agent's responses are. Set via the `set_output_compression` MCP tool or config.
Controls how verbose the agent's responses are. During `cce init`, the configured level is written into instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) so it applies to the **entire session**, not just CCE tool responses.

| Level | Style | Typical savings |
|-------|-------|----------------|
| `off` | Full output | 0% |
| `lite` | Removes filler and hedging | ~30% |
| `standard` | Shorter phrasing, fragments where possible | ~65% |
| `max` | Telegraphic, minimal prose | ~75% |
| `lite` | No filler/hedging, diff-only code | ~25% |
| `standard` | Fragments, short synonyms, diff-only code | ~70% |
| `max` | Telegraphic, abbreviations, diff-only code | ~80% |

Code blocks, file paths, commands, and error messages are never compressed regardless of level.
All levels include code output rules: show only changed lines, never rewrite entire files, never echo back unchanged code. Code blocks, paths, commands, and error messages are never compressed. Security warnings use full clarity.

Change at runtime by telling your agent:
Change the level and re-run `cce init` to update instruction files, or change at runtime:

```
Switch to max output compression
Turn off output compression
set_output_level output_level=max
```

## Embedding model
Expand Down
48 changes: 41 additions & 7 deletions docs-src/src/content/docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,45 @@ description: Frequently asked questions about Code Context Engine.

No. CCE returns the same code your agent would find by reading files, just compressed and targeted. In practice, answers are often better because the agent receives focused, relevant context instead of entire files full of unrelated code.

## How can I increase output savings?
## How does output token savings work?

Set output compression to a higher level:
CCE writes output compression rules directly into your agent's instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) during `cce init`. These apply to the **entire session**, so every response follows them.

Set the level in `cce.yaml`, then re-run `cce init`:

```yaml
compression:
output: max
output: max # off | lite | standard | max
```

Or change at runtime via the MCP tool:

```
set_output_level output_level=max
```

| Level | Savings | Style |
|-------|---------|-------|
| `off` | 0% | Normal verbosity |
| `lite` | ~25% | No filler/hedging, diff-only code |
| `standard` | ~70% | Fragments, short synonyms, diff-only code |
| `max` | ~80% | Telegraphic, abbreviations, diff-only code |

Or tell your agent at runtime: "Switch to max output compression." The `max` level uses telegraphic phrasing and typically saves ~75% on response tokens. Code blocks and file paths are never affected.
Default is `standard`. All levels include code output rules that tell the model to show only changed lines instead of full file rewrites. Code blocks, paths, and commands are never compressed. Security warnings use full clarity.

## Where do the savings come from?

Three main sources:
**Input tokens** (what goes into the model):

1. **Retrieval.** Only relevant chunks are returned instead of the full codebase. This is the largest contributor (often 80%+ reduction).
1. **Retrieval.** Only relevant chunks are returned instead of the full codebase. This is the largest contributor (often 94% reduction).
2. **Chunk compression.** Retrieved chunks are truncated to signatures and docstrings, or summarized via Ollama if available.
3. **Output compression.** Agent responses are shortened by removing filler, hedging, and verbose phrasing.
3. **Grammar compression.** Articles and filler removed from context.
4. **Turn summarization.** Session history compressed.
5. **Progressive disclosure.** Tool payloads filtered.

**Output tokens** (what comes back from the model):

6. **Output compression.** Session-wide style directives in instruction files reduce prose verbosity and enforce diff-only code changes. Output tokens cost 5x more than input (e.g. Opus: $75/1M vs $15/1M), so even moderate output savings have outsized cost impact.

## Is my code sent anywhere?

Expand Down Expand Up @@ -57,6 +78,19 @@ CCE uses Tree-sitter for structural parsing. The following languages have full A

Other file types (YAML, Markdown, config files, etc.) are indexed using line-based chunking. They still appear in search results but without function-level granularity.

## Why does `cce init` fail with "No embedding backend available"?

CCE needs an embedding backend to convert code into searchable vectors. You have two options:

1. **Install with `[local]` extra** (recommended): `uv tool install "code-context-engine[local]"`. This includes fastembed, which works offline with no external services.
2. **Use Ollama**: Start Ollama and run `ollama pull nomic-embed-text`. Then install CCE without `[local]`: `uv tool install code-context-engine`.

If you installed without `[local]` and don't have Ollama running, re-install with the extra:

```bash
uv tool install --force "code-context-engine[local]"
```

## Can I use CCE with multiple agents at once?

Yes. Run `cce init --agent all` to configure every supported agent. They all share the same index and MCP server, so there is no duplication or conflict.
Expand Down
33 changes: 17 additions & 16 deletions docs-src/src/content/docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,23 @@ description: Install CCE and start saving tokens in under a minute

## Install

```bash
uv tool install code-context-engine
```

Or with pipx:
CCE needs an embedding backend to index your code. Pick one:

```bash
pipx install code-context-engine
```
| Option | Install command | What it needs |
|--------|----------------|---------------|
| **Local (recommended)** | `uv tool install "code-context-engine[local]"` | Nothing else. Includes fastembed + ONNX Runtime (~60 MB download on first run). |
| **Ollama** | `uv tool install code-context-engine` | Ollama running at localhost:11434 with `nomic-embed-text` pulled. |

### Optional: Local embedding (no Ollama)
Using pipx instead of uv:

```bash
uv tool install "code-context-engine[local]" # includes fastembed + ONNX Runtime
pipx install "code-context-engine[local]"
```

:::caution
Installing without `[local]` and without Ollama running will cause `cce init` to fail with "No embedding backend available." Always pick one of the two options above.
:::

## Initialize your project

```bash
Expand All @@ -41,11 +42,11 @@ cce init
```

This does everything:
- Detects your embedding backend (Ollama or fastembed)
- Detects your embedding backend (fastembed or Ollama)
- Builds vector, FTS, and graph indexes
- Installs git hooks (auto-updates index on commit)
- Writes MCP config for detected editors
- Creates instruction files
- Creates instruction files with output compression rules

### Target a specific agent

Expand Down Expand Up @@ -79,12 +80,12 @@ cce savings

## Embedding backends

CCE auto-detects the best available backend:
CCE auto-detects the best available backend at init time:

1. **Ollama** (preferred) — If running at localhost:11434, uses `nomic-embed-text`. Zero extra dependencies.
2. **fastembed** — Install with `[local]` extra. Uses `BAAI/bge-small-en-v1.5`. Works offline, ~60 MB download.
1. **fastembed** (with `[local]` extra) — Uses `BAAI/bge-small-en-v1.5`. Works offline, no external services needed. ~60 MB model downloaded on first run.
2. **Ollama** — If running at localhost:11434 with `nomic-embed-text` pulled. Zero extra Python dependencies.

Set `CCE_EMBED_BACKEND=ollama` or `CCE_EMBED_BACKEND=fastembed` to force a specific backend.
Force a specific backend with `CCE_EMBED_BACKEND=fastembed` or `CCE_EMBED_BACKEND=ollama`.

## Next steps

Expand Down
6 changes: 5 additions & 1 deletion docs-src/src/content/docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@ hero:
link: /code-context-engine/guide/getting-started/
icon: right-arrow
variant: primary
- text: View on GitHub
- text: Why CCE?
link: /code-context-engine/guide/why-cce/
icon: information
variant: minimal
- text: GitHub
link: https://github.com/elara-labs/code-context-engine
icon: external
variant: minimal
Expand Down
2 changes: 1 addition & 1 deletion docs-src/src/content/docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,5 @@ CCE parses your code into semantic chunks (functions, classes, modules) using Tr

1. **Index** — Tree-sitter parses code into semantic chunks. Stored locally with vector embeddings.
2. **Search** — Agent calls `context_search` via MCP. Hybrid vector + BM25 merged with Reciprocal Rank Fusion. Graph expansion adds related imports.
3. **Compress** — Chunks are compressed (truncation or LLM summary with Ollama). Output compression reduces reply tokens.
3. **Compress** — Chunks are compressed (truncation or LLM summary with Ollama). Session-wide output compression rules in instruction files reduce reply tokens (diff-only code, no filler).
4. **Track** — Every query recorded. `cce savings` shows tokens and dollars saved.
26 changes: 17 additions & 9 deletions docs-src/src/content/docs/savings-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,28 +30,36 @@ Example output:

## Understanding the input/output split

Savings come from two independent stages:
The report separates input and output token savings because they have different pricing. Output tokens cost 5x more than input (e.g. Opus: $75/1M output vs $15/1M input).

- **Retrieval savings (input).** Instead of sending the entire codebase, CCE returns only the chunks relevant to the query. This is measured as: `1 - (served_tokens / full_codebase_tokens)`.
**Input savings** come from:

- **Compression savings (input).** The retrieved chunks are further compressed (truncation, summarization) before being sent to the agent. This is measured as: `1 - (compressed_tokens / raw_chunk_tokens)`.
- **Retrieval.** Only relevant chunks returned instead of full files (biggest contributor, often 94%).
- **Chunk compression.** Chunks truncated to signatures/docstrings or summarized via Ollama.
- **Grammar compression.** Articles and filler removed from context.
- **Turn summarization.** Session history compressed.
- **Progressive disclosure.** Tool payloads filtered.

The combined effect is multiplicative. If retrieval cuts 90% and compression cuts another 50%, the total savings are 95%.
**Output savings** come from:

- **Output compression.** Session-wide style directives written into instruction files (`CLAUDE.md`, `AGENTS.md`, etc.) during `cce init`. These tell the agent to use compressed prose and diff-only code changes across the entire session. Configure the level in `cce.yaml` (`compression.output`: off/lite/standard/max).

## Per-bucket breakdown

The `How:` line in the output shows the contribution of each stage:
The breakdown shows each savings layer with its contribution:

```
How: retrieval 93% + compression 90%
Breakdown:
retrieval 48% ▰▰▰▰▰▰▰▰▰▰ 6.0k $0.09 · 1 call
chunk compression 20% ▰▰▰▰▱▱▱▱▱▱ 2.6k $0.04 · 1 call
output compression* 2% ▰▱▱▱▱▱▱▱▱▱ 325 $0.02 · 1 call
```

- **retrieval** represents the savings from selecting only relevant chunks.
- **compression** represents the savings from compressing those chunks.
Each row uses the correct pricing (input rate for input buckets, output rate for the output compression bucket). Buckets marked with `*` use estimated values.

## Configuring the pricing model

Cost estimates use model-specific input pricing. Configure which model to estimate for:
Cost estimates use model-specific pricing for both input and output tokens. Configure which model to estimate for:

```yaml
# ~/.cce/config.yaml or .context-engine.yaml
Expand Down
Loading
Loading