elara-labs · rajkumarsakthivel · May 19, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
@@ -64,15 +64,17 @@
 
 ---
 
-## Quick start (3 lines)
+## Quick start
 
 ```bash
-uv tool install code-context-engine
+uv tool install "code-context-engine[local]"    # or: pipx install "code-context-engine[local]"
 cd /path/to/your/project
-cce init                              # or: cce init --agent all
+cce init                                        # or: cce init --agent all
 ```
 
-That's it. Your AI coding agent now searches your index instead of reading entire files. No config needed.
+That's it. Your AI coding agent now searches your index instead of reading entire files.
+
+> **Already have Ollama?** You can skip `[local]` and use `uv tool install code-context-engine` instead. CCE auto-detects Ollama at localhost:11434 and uses `nomic-embed-text`.
 
 ---
 
@@ -92,16 +94,18 @@ Tested on all three platforms in CI (macOS, Linux, Windows × Python 3.11/3.12/3
 
 ## Install and see savings in 60 seconds
 
-```bash
-uv tool install code-context-engine   # or: pipx install code-context-engine
-cd /path/to/your/project
-cce init                              # index, install hooks, register MCP server
-```
+You need an embedding backend to index code. Pick one:
 
-**Embedding backends:** CCE auto-detects the best available backend. If you have Ollama running, it uses `nomic-embed-text` with zero extra dependencies. For offline/local embedding without Ollama, install the `[local]` extra:
+| Option | Install command | Size | Requires |
+|--------|----------------|------|----------|
+| **Local (recommended)** | `uv tool install "code-context-engine[local]"` | +60 MB | Nothing else |
+| **Ollama** | `uv tool install code-context-engine` | Core only | Ollama running + `nomic-embed-text` pulled |
+
+Then:
 
 ```bash
-uv tool install "code-context-engine[local]"   # includes fastembed + ONNX Runtime
+cd /path/to/your/project
+cce init                              # index, install hooks, register MCP server
 ```
 
 Restart your editor. Done. Every question now hits the index instead of re-reading files.
@@ -449,16 +453,18 @@ No. Quality stays the same or slightly improves.
 
 CCE replaces "dump the entire file" with "search for the relevant function." The model still gets the code it needs (0.90 Recall@10 in benchmarks). Less irrelevant context means less noise competing for attention, which can improve the model's focus on your actual question.
 
-### How do I increase output token savings?
+### How does output token savings work?
+
+CCE writes output compression rules directly into your agent's instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) during `cce init`. These rules apply to the **entire session**, not just CCE tool responses, so every reply from the agent follows them.
 
-Set the output compression level in your project config (`cce.yaml`):
+Set the level in `cce.yaml`:
 
 ```yaml
 compression:
   output: max       # off | lite | standard | max
 ```
 
-Or change it at runtime via the MCP tool:
+Then re-run `cce init` to update instruction files. Or change at runtime:
 
 ```
 set_output_level output_level=max
@@ -471,7 +477,7 @@ set_output_level output_level=max
 | `standard` | ~70% | Drops articles, fragments, short synonyms + diff-only for code |
 | `max` | ~80% | Telegraphic style + diff-only for code |
 
-Default is `standard`. All levels include **code output rules** that instruct the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The `max` level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level.
+Default is `standard`. All levels include **code output rules** that tell the model to show only changed lines (not full file rewrites), which is where most output tokens go in coding sessions. The `max` level produces very terse prose (similar to "caveman mode"). Code blocks, paths, and commands are never compressed regardless of level.
 
 ### Where do the savings come from?
 

@@ -14,6 +14,7 @@ export default defineConfig({
       ],
       sidebar: [
         { slug: 'introduction' },
+        { slug: 'why-cce' },
         { slug: 'getting-started' },
         {
           label: 'Agent Setup',

@@ -57,22 +57,21 @@ Controls how much CCE compresses code chunks before including them in the agent'
 
 ### Output compression (`compression.output`)
 
-Controls how verbose the agent's responses are. Set via the `set_output_compression` MCP tool or config.
+Controls how verbose the agent's responses are. During `cce init`, the configured level is written into instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) so it applies to the **entire session**, not just CCE tool responses.
 
 | Level | Style | Typical savings |
 |-------|-------|----------------|
 | `off` | Full output | 0% |
-| `lite` | Removes filler and hedging | ~30% |
-| `standard` | Shorter phrasing, fragments where possible | ~65% |
-| `max` | Telegraphic, minimal prose | ~75% |
+| `lite` | No filler/hedging, diff-only code | ~25% |
+| `standard` | Fragments, short synonyms, diff-only code | ~70% |
+| `max` | Telegraphic, abbreviations, diff-only code | ~80% |
 
-Code blocks, file paths, commands, and error messages are never compressed regardless of level.
+All levels include code output rules: show only changed lines, never rewrite entire files, never echo back unchanged code. Code blocks, paths, commands, and error messages are never compressed. Security warnings use full clarity.
 
-Change at runtime by telling your agent:
+Change the level and re-run `cce init` to update instruction files, or change at runtime:
 
 ```
-Switch to max output compression
-Turn off output compression
+set_output_level output_level=max
 ```
 
 ## Embedding model

@@ -7,24 +7,45 @@ description: Frequently asked questions about Code Context Engine.
 
 No. CCE returns the same code your agent would find by reading files, just compressed and targeted. In practice, answers are often better because the agent receives focused, relevant context instead of entire files full of unrelated code.
 
-## How can I increase output savings?
+## How does output token savings work?
 
-Set output compression to a higher level:
+CCE writes output compression rules directly into your agent's instruction files (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`, etc.) during `cce init`. These apply to the **entire session**, so every response follows them.
+
+Set the level in `cce.yaml`, then re-run `cce init`:
 
 ```yaml
 compression:
-  output: max
+  output: max       # off | lite | standard | max
+```
+
+Or change at runtime via the MCP tool:
+
 ```
+set_output_level output_level=max
+```
+
+| Level | Savings | Style |
+|-------|---------|-------|
+| `off` | 0% | Normal verbosity |
+| `lite` | ~25% | No filler/hedging, diff-only code |
+| `standard` | ~70% | Fragments, short synonyms, diff-only code |
+| `max` | ~80% | Telegraphic, abbreviations, diff-only code |
 
-Or tell your agent at runtime: "Switch to max output compression." The `max` level uses telegraphic phrasing and typically saves ~75% on response tokens. Code blocks and file paths are never affected.
+Default is `standard`. All levels include code output rules that tell the model to show only changed lines instead of full file rewrites. Code blocks, paths, and commands are never compressed. Security warnings use full clarity.
 
 ## Where do the savings come from?
 
-Three main sources:
+**Input tokens** (what goes into the model):
 
-1. **Retrieval.** Only relevant chunks are returned instead of the full codebase. This is the largest contributor (often 80%+ reduction).
+1. **Retrieval.** Only relevant chunks are returned instead of the full codebase. This is the largest contributor (often 94% reduction).
 2. **Chunk compression.** Retrieved chunks are truncated to signatures and docstrings, or summarized via Ollama if available.
-3. **Output compression.** Agent responses are shortened by removing filler, hedging, and verbose phrasing.
+3. **Grammar compression.** Articles and filler removed from context.
+4. **Turn summarization.** Session history compressed.
+5. **Progressive disclosure.** Tool payloads filtered.
+
+**Output tokens** (what comes back from the model):
+
+6. **Output compression.** Session-wide style directives in instruction files reduce prose verbosity and enforce diff-only code changes. Output tokens cost 5x more than input (e.g. Opus: $75/1M vs $15/1M), so even moderate output savings have outsized cost impact.
 
 ## Is my code sent anywhere?
 
@@ -57,6 +78,19 @@ CCE uses Tree-sitter for structural parsing. The following languages have full A
 
 Other file types (YAML, Markdown, config files, etc.) are indexed using line-based chunking. They still appear in search results but without function-level granularity.
 
+## Why does `cce init` fail with "No embedding backend available"?
+
+CCE needs an embedding backend to convert code into searchable vectors. You have two options:
+
+1. **Install with `[local]` extra** (recommended): `uv tool install "code-context-engine[local]"`. This includes fastembed, which works offline with no external services.
+2. **Use Ollama**: Start Ollama and run `ollama pull nomic-embed-text`. Then install CCE without `[local]`: `uv tool install code-context-engine`.
+
+If you installed without `[local]` and don't have Ollama running, re-install with the extra:
+
+```bash
+uv tool install --force "code-context-engine[local]"
+```
+
 ## Can I use CCE with multiple agents at once?
 
 Yes. Run `cce init --agent all` to configure every supported agent. They all share the same index and MCP server, so there is no duplication or conflict.

@@ -17,22 +17,23 @@ description: Install CCE and start saving tokens in under a minute
 
 ## Install
 
-```bash
-uv tool install code-context-engine
-```
-
-Or with pipx:
+CCE needs an embedding backend to index your code. Pick one:
 
-```bash
-pipx install code-context-engine
-```
+| Option | Install command | What it needs |
+|--------|----------------|---------------|
+| **Local (recommended)** | `uv tool install "code-context-engine[local]"` | Nothing else. Includes fastembed + ONNX Runtime (~60 MB download on first run). |
+| **Ollama** | `uv tool install code-context-engine` | Ollama running at localhost:11434 with `nomic-embed-text` pulled. |
 
-### Optional: Local embedding (no Ollama)
+Using pipx instead of uv:
 
 ```bash
-uv tool install "code-context-engine[local]"   # includes fastembed + ONNX Runtime
+pipx install "code-context-engine[local]"
 ```
 
+:::caution
+Installing without `[local]` and without Ollama running will cause `cce init` to fail with "No embedding backend available." Always pick one of the two options above.
+:::
+
 ## Initialize your project
 
 ```bash
@@ -41,11 +42,11 @@ cce init
 ```
 
 This does everything:
-- Detects your embedding backend (Ollama or fastembed)
+- Detects your embedding backend (fastembed or Ollama)
 - Builds vector, FTS, and graph indexes
 - Installs git hooks (auto-updates index on commit)
 - Writes MCP config for detected editors
-- Creates instruction files
+- Creates instruction files with output compression rules
 
 ### Target a specific agent
 
@@ -79,12 +80,12 @@ cce savings
 
 ## Embedding backends
 
-CCE auto-detects the best available backend:
+CCE auto-detects the best available backend at init time:
 
-1. **Ollama** (preferred) — If running at localhost:11434, uses `nomic-embed-text`. Zero extra dependencies.
-2. **fastembed** — Install with `[local]` extra. Uses `BAAI/bge-small-en-v1.5`. Works offline, ~60 MB download.
+1. **fastembed** (with `[local]` extra) — Uses `BAAI/bge-small-en-v1.5`. Works offline, no external services needed. ~60 MB model downloaded on first run.
+2. **Ollama** — If running at localhost:11434 with `nomic-embed-text` pulled. Zero extra Python dependencies.
 
-Set `CCE_EMBED_BACKEND=ollama` or `CCE_EMBED_BACKEND=fastembed` to force a specific backend.
+Force a specific backend with `CCE_EMBED_BACKEND=fastembed` or `CCE_EMBED_BACKEND=ollama`.
 
 ## Next steps
 

@@ -10,7 +10,11 @@ hero:
       link: /code-context-engine/guide/getting-started/
       icon: right-arrow
       variant: primary
-    - text: View on GitHub
+    - text: Why CCE?
+      link: /code-context-engine/guide/why-cce/
+      icon: information
+      variant: minimal
+    - text: GitHub
       link: https://github.com/elara-labs/code-context-engine
       icon: external
       variant: minimal

@@ -45,5 +45,5 @@ CCE parses your code into semantic chunks (functions, classes, modules) using Tr
 
 1. **Index** — Tree-sitter parses code into semantic chunks. Stored locally with vector embeddings.
 2. **Search** — Agent calls `context_search` via MCP. Hybrid vector + BM25 merged with Reciprocal Rank Fusion. Graph expansion adds related imports.
-3. **Compress** — Chunks are compressed (truncation or LLM summary with Ollama). Output compression reduces reply tokens.
+3. **Compress** — Chunks are compressed (truncation or LLM summary with Ollama). Session-wide output compression rules in instruction files reduce reply tokens (diff-only code, no filler).
 4. **Track** — Every query recorded. `cce savings` shows tokens and dollars saved.
@@ -30,28 +30,36 @@ Example output:
 
 ## Understanding the input/output split
 
-Savings come from two independent stages:
+The report separates input and output token savings because they have different pricing. Output tokens cost 5x more than input (e.g. Opus: $75/1M output vs $15/1M input).
 
-- **Retrieval savings (input).** Instead of sending the entire codebase, CCE returns only the chunks relevant to the query. This is measured as: `1 - (served_tokens / full_codebase_tokens)`.
+**Input savings** come from:
 
-- **Compression savings (input).** The retrieved chunks are further compressed (truncation, summarization) before being sent to the agent. This is measured as: `1 - (compressed_tokens / raw_chunk_tokens)`.
+- **Retrieval.** Only relevant chunks returned instead of full files (biggest contributor, often 94%).
+- **Chunk compression.** Chunks truncated to signatures/docstrings or summarized via Ollama.
+- **Grammar compression.** Articles and filler removed from context.
+- **Turn summarization.** Session history compressed.
+- **Progressive disclosure.** Tool payloads filtered.
 
-The combined effect is multiplicative. If retrieval cuts 90% and compression cuts another 50%, the total savings are 95%.
+**Output savings** come from:
+
+- **Output compression.** Session-wide style directives written into instruction files (`CLAUDE.md`, `AGENTS.md`, etc.) during `cce init`. These tell the agent to use compressed prose and diff-only code changes across the entire session. Configure the level in `cce.yaml` (`compression.output`: off/lite/standard/max).
 
 ## Per-bucket breakdown
 
-The `How:` line in the output shows the contribution of each stage:
+The breakdown shows each savings layer with its contribution:
 
 ```
-How:  retrieval 93%  +  compression 90%
+  Breakdown:
+    retrieval              48%  ▰▰▰▰▰▰▰▰▰▰    6.0k    $0.09 · 1 call
+    chunk compression      20%  ▰▰▰▰▱▱▱▱▱▱    2.6k    $0.04 · 1 call
+    output compression*     2%  ▰▱▱▱▱▱▱▱▱▱     325    $0.02 · 1 call
 ```
 
-- **retrieval** represents the savings from selecting only relevant chunks.
-- **compression** represents the savings from compressing those chunks.
+Each row uses the correct pricing (input rate for input buckets, output rate for the output compression bucket). Buckets marked with `*` use estimated values.
 
 ## Configuring the pricing model
 
-Cost estimates use model-specific input pricing. Configure which model to estimate for:
+Cost estimates use model-specific pricing for both input and output tokens. Configure which model to estimate for:
 
 ```yaml
 # ~/.cce/config.yaml or .context-engine.yaml