How Canopy Works

The big picture

Your repo (files on disk)
        │
        ▼ canopy index .
┌───────────────────────────────────┐
│         Indexing pipeline         │
│                                   │
│  1. File scanner                  │
│     Walk repo tree, apply         │
│     .gitignore + language filters │
│                                   │
│  2. tree-sitter AST parser        │
│     Parse each source file into   │
│     a concrete syntax tree        │
│                                   │
│  3. Symbol extractor              │
│     Functions, classes, types,    │
│     interfaces, imports, exports  │
│                                   │
│  4. Dependency edge builder       │
│     Resolve imports → file paths  │
│     Build directed graph in SQLite│
│                                   │
│  5. (optional) Tantivy indexer    │
│     Full-text search over code    │
│     chunks with camelCase aware   │
│     tokenization                  │
│                                   │
│  6. (optional) Git ingester       │
│     Commit history + blame data   │
│     into SQLite                   │
└───────────────────────────────────┘
        │
        ▼ stored locally
┌───────────────────┐
│   ~/.canopy/<id>/  │
│                   │
│  forge.db         │  ← AST graph, symbols, imports, health cache
│  search/          │  ← Tantivy full-text index
│  git.db           │  ← Commit history (when --with-git)
└───────────────────┘
        │
        ▼ canopy serve .
┌───────────────────────────────────┐
│          MCP stdio server         │
│                                   │
│  JSON-RPC 2.0 over stdin/stdout   │
│                                   │
│  On connect: inject server        │
│  instructions into agent context  │
│                                   │
│  On tool call: query index,       │
│  return structured JSON           │
└───────────────────────────────────┘
        │
        ▼ MCP tools
┌────────────────────────────────────────┐
│              AI agent                  │
│  (Claude Code, Cursor, Windsurf, ...)  │
│                                        │
│  canopy_prepare → plan refactors        │
│  canopy_search → find code by concept  │
│  canopy_trace_dependents → find callers│
│  canopy_health_check → find problems   │
│  ... (18 more tools)                  │
└────────────────────────────────────────┘

Phase 1: Indexing

canopy index . runs the indexing pipeline. It’s designed to be run incrementally — on first run it processes every file; on subsequent runs it checks modification times and only re-processes changed files.

What gets parsed

Canopy uses tree-sitter grammars to parse source files into concrete syntax trees (CSTs). From each CST, Canopy extracts:

Functions and methods — name, line range, parameter list, return type (where inferrable)
Classes and structs — name, implemented interfaces, parent class
TypeScript/Rust types and interfaces — name, field list
Import statements — the import path, what’s being imported (named vs default vs namespace)
Export statements — what’s being exported, whether it’s a re-export

Every extracted symbol is stored in SQLite with its file path, line number, and language. Every import edge is stored as a directed edge: (source_file, target_file, specifier).

What doesn’t get parsed (at AST level)

Files in languages Canopy doesn’t have tree-sitter grammars for — Ruby, Java, C, C++, Swift, Kotlin, etc. — are still indexed by the full-text search layer (if --with-search was used) and appear in file-level search results. They just don’t contribute to the dependency graph or symbol table.

Incremental vs full index

canopy index . is incremental by default. It reads the last-modified timestamp of each file and skips unchanged files. This makes daily-use re-indexing fast (typically seconds for a repo with <10 changed files).

canopy index . --full wipes the SQLite database and re-indexes from scratch. Use this when you’ve changed Canopy’s config, added a new language, or suspect the index has drifted from the actual files.

Phase 2: Storage

Canopy uses three storage layers:

SQLite (`forge.db`)

The primary store. Contains:

files — every indexed file path, language, last-modified time
symbols — every extracted function/class/type with file + line
imports — dependency edges: source_file → target_file
exports — named exports per file
health_findings — cached results from the last canopy health run
heartbeat — license validation cache (added v1.3.0)

SQLite was chosen because it’s zero-dependency, handles concurrent readers, and gives Canopy’s dependency graph queries predictable performance.

Tantivy full-text index (`search/`)

Tantivy is a Rust-native full-text search library (similar to Lucene). When you run --with-search, Canopy chunks each source file into function/class/block segments and indexes them. The tokenizer is camelCase-aware: searching for payment matches processPayment, PaymentService, and payment_handler without wildcards.

Built only on request because it adds ~10–30 seconds to the initial index time and consumes more disk space (~20% of repo size in search index).

Git data (embedded in `forge.db`)

When --with-git is used, Canopy reads the git object database with the gix crate and stores commit metadata and blame records in SQLite. This powers:

canopy_git_history — last N commits touching a file
canopy_git_blame — per-line author and commit for a file range

Phase 3: MCP server

canopy serve . starts the MCP server. It listens on stdin and writes to stdout using JSON-RPC 2.0. The MCP client (Claude Code, etc.) launches Canopy as a subprocess via the config in .mcp.json.

Server instructions (the key behavior)

At MCP connection time (the initialize handshake), Canopy injects a block of behavioral instructions into the agent’s system prompt. These instructions teach the agent:

Call canopy_prepare before modifying any file
Call canopy_validate after edits are complete
Use canopy_understand when encountering unfamiliar code
Use specific tools for targeted lookups vs workflow composites for typical tasks

This is why Canopy’s behavior in Claude Code is automatic without user prompting. The agent learns the correct usage pattern from Canopy’s own server instructions at the start of every session.

Tool dispatch

When the agent calls a tool, Canopy:

Validates the input against the tool’s JSON Schema
Runs the appropriate query against SQLite (and/or Tantivy)
Returns a structured JSON result
Logs the call to ~/.canopy/stats.json (local only, never transmitted)

All queries run in-process — there are no sub-processes or HTTP calls during normal tool operation.

Why local-first matters

Every architecture decision in Canopy optimizes for local-first operation:

SQLite over a network database — no latency, no authentication, no network dependency for queries
Static binary with bundled runtime — no Node.js, no Python, no JVM to install or version-manage
Incremental index — re-indexing fits into a normal development loop without blocking the agent

The tradeoff is that Canopy doesn’t sync across machines by default. Each machine has its own index. Team tier adds CI cache support for sharing indexes across CI runs.