Writing · 2026-04-08
Building Indexed Code Search for Gitea with MeiliSearch and MCP
LLM coding agents search code the way a person reads a phone book: sequentially, one page at a time. Claude Code, Aider, and similar tools issue chains of grep and find commands to locate relevant files. Each round-trip costs 5–10 seconds and burns tokens on directory listings that lead nowhere. For a small project this is tolerable. Across 33 Gitea repositories totaling ~180k lines, it becomes the bottleneck.
This post describes a system that replaces sequential search with indexed search, exposed as an MCP tool that any LLM agent can call directly.
Source: gitea.rspworks.tech/rpert/gitea-search
The Problem
The immediate trigger was operational documentation. Over months of running a production infrastructure stack (K8s, mail, DNS, monitoring, backups across 8 servers), context docs for LLM agents grew to 4,154 lines of structured markdown—memory files, runbooks, architecture notes. All of it loaded into the LLM’s context window at conversation start.
The naive solution is obvious: offload the detail into Gitea repos and keep only short pointers in the agent’s memory. This reduces the always-loaded context from 4,154 lines to ~3,400 lines of lean pointers. But it creates a new problem: the agent now needs to search those repos whenever it follows a pointer or needs detail it no longer has in memory.
The default search mechanism for LLM agents is sequential grep and find over SSH:
- 5–10 seconds per query (SSH hop + recursive grep + result formatting)
- No fuzzy matching—a typo in the query returns nothing
- No ranking—grep returns every match with equal weight, dumping noise into the context
- Multiple rounds—the agent typically needs 2–4 grep attempts to find what it needs, each burning tokens on tool-call overhead
The tradeoff is clear: loading everything into memory is expensive but fast; offloading to repos is cheap but search is slow. Indexed search eliminates the tradeoff—the agent gets a vastly larger effective knowledge base (993 documents across 38 repos) without paying the context window cost, and retrieval is faster than either approach.
Architecture
Gitea repos (33)
|
| webhook (push events)
v
Go indexer ————> MeiliSearch
(CronJob + (Rust)
webhook receiver) |
| HTTP API
v
MCP server
(Go, stdio)
|
| MCP tool calls
v
LLM agent
(Claude Code, etc.)
Gitea sends push webhooks on every commit. A Go indexer receives the webhook, pulls the affected repo via Gitea’s API, splits files into searchable documents (one per file, with path, repo name, content, and language metadata), and upserts them into MeiliSearch. A CronJob runs the full re-index every 6 hours as a consistency backstop.
MeiliSearch is the search engine—a single Rust binary that handles tokenization, typo tolerance, and ranking. It runs as a pod in the K8s cluster with a 512Mi memory limit, which comfortably indexes the full corpus.
The MCP server exposes a gitea_search tool over the Model Context Protocol (stdio transport). When an LLM agent calls the tool with a query string and optional filters (repo name, file extension), the server queries MeiliSearch’s HTTP API, formats the top results with file paths and highlighted matches, and returns them as tool output.
Implementation Details
Indexer (Go)
The indexer is a single Go binary with two modes: full re-index and webhook handler. Full re-index iterates all repos via GET /api/v1/repos/search, shallow-clones each one, and indexes every text file under 50KB. Binary files are detected by null-byte scan in the first 512 bytes and skipped.
Each document in MeiliSearch gets a composite ID of {repo}:{filepath}:{branch} to handle renames and deletions cleanly—a re-index of a repo deletes all documents for that repo first, then inserts the current state.
MeiliSearch Configuration
Searchable attributes, ranked by weight: filepath, content, repo, language. Typo tolerance uses MeiliSearch defaults (2 typos for words longer than 8 characters, 1 for words 5–8). Filterable attributes: repo, extension, branch.
MCP Server (Go)
The MCP server implements the stdio transport from the MCP specification. It registers a single tool with parameters: query (string, required), repo (string, optional), filetype (string, optional), limit (integer, optional, default 10). Results are formatted as structured text with repo name, file path, and content snippets.
Kubernetes Deployment
Three manifests in the gitea-search namespace: MeiliSearch Deployment (single replica, 512Mi limit, PVC for index persistence), Indexer CronJob (every 6 hours, full re-index mode), and a webhook receiver Deployment. The MCP server runs locally as a stdio process invoked by the LLM agent’s MCP client—not deployed as a pod.
Results
To be clear about what this changes and what it does not: the agent’s always-loaded memory (MEMORY.md, ~3,400 lines of pointers and decision context) costs the same tokens regardless. What the indexed search replaces is grep/find over SSH—the mechanism the agent uses when it needs to retrieve detail from a repo that a memory pointer references.
Search performance: grep vs MeiliSearch
| Metric | grep over SSH | MeiliSearch MCP |
|---|---|---|
| Median query latency | ~6s | <50ms |
| Fuzzy match | No | Yes (2-typo tolerance) |
| Cross-repo | One repo at a time | All 38 repos, single query |
| Ranking | None (all matches equal) | Relevance-ranked with snippets |
| Typical search rounds | 2–4 (trial and error) | 1 (ranked results) |
Effective knowledge base
| Metric | Before (all in memory) | After (pointers + indexed search) |
|---|---|---|
| Always-loaded context | 4,154 lines | 3,401 lines (18% reduction) |
| Searchable knowledge | Same 4,154 lines | 993 documents across 38 repos |
| Retrieval cost | Zero (already loaded) | <50ms per query |
The important row is “searchable knowledge.” The agent’s effective memory went from ~4K lines (everything crammed into context) to a corpus of 993 indexed documents, with the context window cost dropping by 18%. The search layer makes the offloaded knowledge almost as accessible as if it were still in memory—at a fraction of the token cost and with fuzzy matching that in-memory lookup cannot provide.
Index size for the full corpus: 47MB on disk. Full re-index time: 82 seconds for 38 repos. Incremental webhook updates: <500ms per push.
Limitations
- No semantic search. MeiliSearch is keyword-based with typo tolerance. It does not understand that “backup retention policy” and “how long are snapshots kept” mean the same thing.
- Single-node MeiliSearch. Adequate for this scale. Not a concern unless the corpus grows by an order of magnitude.
- Stdio transport only. The MCP server runs as a local process, not a network service. Stateless by design.
Takeaway
The interesting part is not any individual component—MeiliSearch, Go, MCP, and webhooks are all well-documented, stable tools. The value is in the composition: connecting a private Git forge to a search engine to an LLM tool protocol, deployed on the same Kubernetes cluster that hosts everything else. The system closes the loop between “store operational knowledge in Git” and “make that knowledge instantly accessible to the agents that need it.”
The full source, including K8s manifests, is at gitea.rspworks.tech/rpert/gitea-search.