Skip to content

How Canopy Works

Your repo (files on disk)
▼ canopy index .
┌───────────────────────────────────┐
│ Indexing pipeline │
│ │
│ 1. File scanner │
│ Walk repo tree, apply │
│ .gitignore + language filters │
│ │
│ 2. tree-sitter AST parser │
│ Parse each source file into │
│ a concrete syntax tree │
│ │
│ 3. Symbol extractor │
│ Functions, classes, types, │
│ interfaces, imports, exports │
│ │
│ 4. Dependency edge builder │
│ Resolve imports → file paths │
│ Build directed graph in SQLite│
│ │
│ 5. (optional) Tantivy indexer │
│ Full-text search over code │
│ chunks with camelCase aware │
│ tokenization │
│ │
│ 6. (optional) Git ingester │
│ Commit history + blame data │
│ into SQLite │
└───────────────────────────────────┘
▼ stored locally
┌───────────────────┐
│ ~/.canopy/<id>/ │
│ │
│ forge.db │ ← AST graph, symbols, imports, health cache
│ search/ │ ← Tantivy full-text index
│ git.db │ ← Commit history (when --with-git)
└───────────────────┘
▼ canopy serve .
┌───────────────────────────────────┐
│ MCP stdio server │
│ │
│ JSON-RPC 2.0 over stdin/stdout │
│ │
│ On connect: inject server │
│ instructions into agent context │
│ │
│ On tool call: query index, │
│ return structured JSON │
└───────────────────────────────────┘
▼ MCP tools
┌────────────────────────────────────────┐
│ AI agent │
│ (Claude Code, Cursor, Windsurf, ...) │
│ │
│ canopy_prepare → plan refactors │
│ canopy_search → find code by concept │
│ canopy_trace_dependents → find callers│
│ canopy_health_check → find problems │
│ ... (18 more tools) │
└────────────────────────────────────────┘

canopy index . runs the indexing pipeline. It’s designed to be run incrementally — on first run it processes every file; on subsequent runs it checks modification times and only re-processes changed files.

Canopy uses tree-sitter grammars to parse source files into concrete syntax trees (CSTs). From each CST, Canopy extracts:

  • Functions and methods — name, line range, parameter list, return type (where inferrable)
  • Classes and structs — name, implemented interfaces, parent class
  • TypeScript/Rust types and interfaces — name, field list
  • Import statements — the import path, what’s being imported (named vs default vs namespace)
  • Export statements — what’s being exported, whether it’s a re-export

Every extracted symbol is stored in SQLite with its file path, line number, and language. Every import edge is stored as a directed edge: (source_file, target_file, specifier).

Files in languages Canopy doesn’t have tree-sitter grammars for — Ruby, Java, C, C++, Swift, Kotlin, etc. — are still indexed by the full-text search layer (if --with-search was used) and appear in file-level search results. They just don’t contribute to the dependency graph or symbol table.

canopy index . is incremental by default. It reads the last-modified timestamp of each file and skips unchanged files. This makes daily-use re-indexing fast (typically seconds for a repo with <10 changed files).

canopy index . --full wipes the SQLite database and re-indexes from scratch. Use this when you’ve changed Canopy’s config, added a new language, or suspect the index has drifted from the actual files.

Canopy uses three storage layers:

The primary store. Contains:

  • files — every indexed file path, language, last-modified time
  • symbols — every extracted function/class/type with file + line
  • imports — dependency edges: source_file → target_file
  • exports — named exports per file
  • health_findings — cached results from the last canopy health run
  • heartbeat — license validation cache (added v1.3.0)

SQLite was chosen because it’s zero-dependency, handles concurrent readers, and gives Canopy’s dependency graph queries predictable performance.

Tantivy is a Rust-native full-text search library (similar to Lucene). When you run --with-search, Canopy chunks each source file into function/class/block segments and indexes them. The tokenizer is camelCase-aware: searching for payment matches processPayment, PaymentService, and payment_handler without wildcards.

Built only on request because it adds ~10–30 seconds to the initial index time and consumes more disk space (~20% of repo size in search index).

When --with-git is used, Canopy reads the git object database with the gix crate and stores commit metadata and blame records in SQLite. This powers:

  • canopy_git_history — last N commits touching a file
  • canopy_git_blame — per-line author and commit for a file range

canopy serve . starts the MCP server. It listens on stdin and writes to stdout using JSON-RPC 2.0. The MCP client (Claude Code, etc.) launches Canopy as a subprocess via the config in .mcp.json.

At MCP connection time (the initialize handshake), Canopy injects a block of behavioral instructions into the agent’s system prompt. These instructions teach the agent:

  • Call canopy_prepare before modifying any file
  • Call canopy_validate after edits are complete
  • Use canopy_understand when encountering unfamiliar code
  • Use specific tools for targeted lookups vs workflow composites for typical tasks

This is why Canopy’s behavior in Claude Code is automatic without user prompting. The agent learns the correct usage pattern from Canopy’s own server instructions at the start of every session.

When the agent calls a tool, Canopy:

  1. Validates the input against the tool’s JSON Schema
  2. Runs the appropriate query against SQLite (and/or Tantivy)
  3. Returns a structured JSON result
  4. Logs the call to ~/.canopy/stats.json (local only, never transmitted)

All queries run in-process — there are no sub-processes or HTTP calls during normal tool operation.

Every architecture decision in Canopy optimizes for local-first operation:

  • SQLite over a network database — no latency, no authentication, no network dependency for queries
  • Static binary with bundled runtime — no Node.js, no Python, no JVM to install or version-manage
  • Incremental index — re-indexing fits into a normal development loop without blocking the agent

The tradeoff is that Canopy doesn’t sync across machines by default. Each machine has its own index. Team tier adds CI cache support for sharing indexes across CI runs.