Codebase Search

最近看到一篇挺有意思的文章：Why Cline Doesn’t Index Your Codebase (And Why That’s a Good Thing) - Cline Blog

文章里有个很生动的例子，让我重新思考了代码搜索的问题：

这意味着什么？
举个具体例子：你让Cline给支付处理函数添加错误处理。
基于RAG的方法：
在向量空间中搜索"payment"和"error"
检索恰好包含这些词的代码片段
可能会错过你团队构建的自定义错误处理框架
建议通用的try-catch块，不符合你的代码模式
Cline的方法：
定位支付处理函数
追踪其导入来找到你的错误处理工具
检查类似函数来理解你的模式
检查调用函数来理解错误契约
建议完全符合你架构的错误处理
区别在哪？对你代码库的连贯理解，而不是对所有文件的摘要级理解。

我觉得 Cline 描述的是理想状态：系统真正理解代码语义和调用关系，而不是只在片段里捞关键词。但现实里离这个目标还有距离。RAG 虽然不够“懂代码库”，但在快速捞信息、理解局部模块、配合文档熟悉项目时，反而更直接、更省事。

1. 向量数据库 Qdrant

启动Qdrant容器

拉取镜像

docker pull qdrant/qdrant

启动 Qdrant 容器服务

docker run -d \
    --name qdrant_server \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    -p 6333:6333 \
    qdrant/qdrant

从 qdrant/qdrant 镜像创建一个名为 qdrant_server 的容器。
将宿主机的 $(pwd)/qdrant_storage 目录挂载到容器的 /qdrant/storage 目录，以实现数据持久化。
将宿主机的 6333 端口映射到容器的 6333 端口，以便通过宿主机访问 Qdrant 服务。
容器在后台运行，不会占用当前终端。

docker logs qdrant_server

可以看到下面日志：

yang@Yangless:~/OpenManus$ sudo docker logs qdrant_server
           _                 _
  __ _  __| |_ __ __ _ _ __ | |_
 / _` |/ _` | '__/ _` | '_ \| __|
| (_| | (_| | | | (_| | | | | |_
 \__, |\__,_|_|  \__,_|_| |_|\__|
    |_|

Version: 1.14.1, build: 530430fa
Access web UI at http://localhost:6333/dashboard

2025-06-18T02:04:25.098594Z  INFO storage::content_manager::consensus::persistent: Initializing new raft state at ./storage/raft_state.json
2025-06-18T02:04:25.123907Z  INFO qdrant: Distributed mode disabled
2025-06-18T02:04:25.124458Z  INFO qdrant: Telemetry reporting enabled, id: f32569f3-098a-4496-ae66-0b0eba3bdee6
2025-06-18T02:04:25.125168Z  INFO qdrant: Inference service is not configured.
2025-06-18T02:04:25.128123Z  INFO qdrant::actix: TLS disabled for REST API
2025-06-18T02:04:25.128199Z  INFO qdrant::actix: Qdrant HTTP listening on 6333
2025-06-18T02:04:25.128353Z  INFO actix_server::builder: starting 7 workers
2025-06-18T02:04:25.128378Z  INFO actix_server::server: Actix runtime found; starting in Actix runtime
2025-06-18T02:04:25.128381Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:6333", workers: 7, listening on: 0.0.0.0:6333
2025-06-18T02:04:25.130477Z  INFO qdrant::tonic: Qdrant gRPC listening on 6334
2025-06-18T02:04:25.130493Z  INFO qdrant::tonic: TLS disabled for gRPC API

2. Embedding 模型：nomic-embed-text

这里用 nomic-embed-text 做嵌入模型。它基于 Sentence Transformers，适合特征提取和句子相似度计算。我试下来，它在分类、检索、聚类这类任务里表现都不错，生成的句子嵌入质量也够用。

下载 nomic-embed-text 模型

用命令行下载很简单：

ollama pull nomic-embed-text

查看运行状况：

ollama list

yang@Yangless:~/OpenManus$ ollama list
NAME                         ID              SIZE      MODIFIED
nomic-embed-text:latest      0a109f422b47    274 MB    19 minutes ago
Qwen3-8B-BF16.gguf:latest    32620bdfde2a    16 GB     5 weeks ago
llama3.2:latest              a80c4f17acd5    2.0 GB    5 weeks ago

在 WSL 环境下，需要先看一下 IP 地址，方便在 Windows 里调用：

yang@Yangless:~/OpenManus$ hostname -I
172.26.20.47 172.17.0.1

测试一下嵌入 API：

yang@Yangless:~/Roo-Code$ curl http://172.26.20.47:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Why is the sky blue?"
}'
{"model":"nomic-embed-text","embeddings":[[0.009776355,0.044323925,-0.14051996,0.0012110417,0.032160897,0.107437715,-0.008488253,0.010181047,0.0007287834,-0.035362013,0.033811368,0.062149946,0.102554426,0.08564908,0.02366101,0.033607,-0.03356383,-0.018563574,0.048045073,-0.026967347,-0.056341264,-0.04372835,0.016524935,-0.034954622,0.06359335,0.04324542,0.03344377,-0.0003335339,0.000013826987,-0.018919408,0.0580005,0.002397802,0.01843542,-0.037297793,0.032865062,-0.059681322,0.066892944,0.026862126,0.0063907662,-0.016804932,0.0021118193,-0.035640974,-0.010764647,0.008815547,0.022905584,-0.049736686,0.01517613,0.050247945,-0.022723474,-0.050869823,-0.040053748,0.05884125,0.0022002193,-0.07127981,0.029803565,0.03862788,0.0640547,-0.030312466,-0.022575086,0.026461482,0.04690114,0.08042046,0.048403896,0.08324791,0.03957621,-0.049871583,-0.045406476,-0.005782252,0.02759462,-0.0075440546,0.045668177,-0.06911605,0.018673178,0.045374103,-0.042050228,-0.037026647,-0.053222436,0.020874536,0.026776804,0.051039793,0.02964641,0.026097259,0.0006414953,0.0048449677,0.017336344,0.0285408,0.026794678,-0.0024107858,-0.022615556,-0.009006933,-0.0032530518,-0.033786412,0.047818445,-0.013370022,-0.022947403,0.010468716,-0.010219198,0.065975636,-0.055626307,0.03517316,-0.08174245,-0.05712367,-0.036576092,0.041095156,0.05164144,0.039824422,-0.0045150174,-0.009581259,-0.06830667,-0.030202862,0.0035076917,0.040202133,0.0028079743,0.020055877,0.00459663,-0.023843797,0.008705585,-0.05718572,0.0130439075,0.069874175,0.008697175,-0.009527806,-0.0009517014,-0.04519064,-0.016905937,0.030071337,-0.00024635645,-0.00685279,-0.01280379,-0.060193934,-0.027039854,-0.021235554,-0.0040082242,0.010798034,0.009165438,0.01550646,0.024715487,-0.045948766,0.018416198,0.04708325,0.033242777,0.007809381,0.05183029,-0.039924245,-0.014552399,-0.046967242,0.0035317836,0.0063597397,-0.0040324214,-0.04196457,0.0011257192,-0.010744918,0.03089556,0.021573136,0.0018400011,-0.044057503,0.012425403,0.00294616,0.03491685,0.048474044,0.08998199,0.00377599,-0.020732388,0.002766641,0.01025524,-0.027951697,0.027317068,0.013810083,0.035107058,0.035069287,-0.046491053,-0.038915213,-0.04343474,-0.03672313,0.023469456,0.019677488,0.053751234,0.016813617,0.045575097,-0.029662598,-0.011231897,-0.047562137,0.014721779,-0.0015712179,0.017200097,-0.0026507347,-0.05977575,-0.075682856,-0.024388783,0.0020140188,0.023157168,-0.010981663,-0.0703706,-0.038175974,-0.000820345,-0.0645781,0.030315502,0.06214725,0.0074186847,0.008182204,0.0046615493,-0.00065258745,0.016729895,0.00082525617,-0.017683705,0.053627126,0.0016933001,0.0635947,-0.03783148,0.008455034,0.05539968,-0.008906097,0.010003235,0.00542861,0.04302149,0.0017897517,-0.055445544,-0.038278498,0.000032586162,0.047680847,-0.020339161,-0.0019236375,0.034467645,-0.010021531,0.0338343,0.04207721,-0.055952758,0.015465064,0.027366644,0.0034861714,-0.01226066,-0.012835491,0.034091953,0.061497044,0.013582823,0.036172748,0.05555616,0.061696693,0.011708423,-0.0007737211,0.0185774,-0.012757757,0.020119278,0.024277491,-0.034151983,-0.009396955,-0.04150086,0.02026834]],"total_duration":1573287934,"load_duration":1385129951,"prompt_eval_count":6}

3. 代码索引系统

我设计的索引系统工作流程是这样的：

使用 Tree-sitter 解析代码，识别语义块（如函数、类、方法）
使用 AI 模型为每个代码块生成嵌入表示
将向量保存在 Qdrant 数据库中，以便进行快速相似性搜索
使用codebase_search工具，实现智能代码发现

4. 索引使用

进入控制面板 http://localhost:6333/dashboard#/console

查看索引：

// List all collections
GET collections

// Get collection info
GET collections/ws-906c3fe5023f64b5

返回的配置信息：

{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "indexed_vectors_count": 0,
    "points_count": 0,
    "segments_count": 8,
    "config": {
      "params": {
        "vectors": {
          "size": 768,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": null
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null,
      "strict_mode_config": {
        "enabled": false
      }
    },
    "payload_schema": {
      "pathSegments.2": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.1": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.0": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.4": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.3": {
        "data_type": "keyword",
        "points": 0
      }
    }
  },
  "status": "ok",
  "time": 0.000075857
}

这里开始出问题：代码没有被索引，只能继续翻源码。

源码实现：

src/core/prompts/tools/codebase-search.ts

它提供了 codebase_search 工具的 Markdown 描述、参数说明和使用示例，也就是 AI 在工具定义里能看到的内容。

export function getCodebaseSearchDescription(): string {
	return `## codebase_search
Description: Find files most relevant to the search query.\nThis is a semantic search tool, so the query should ask for something semantically matching what is needed.\nIf it makes sense to only search in a particular directory, please specify it in the path parameter.\nUnless there is a clear reason to use your own search query, please just reuse the user's exact query with their wording.\nTheir exact wording/phrasing can often be helpful for the semantic search query. Keeping the same exact question format can also be helpful.\nIMPORTANT: Queries MUST be in English. Translate non-English queries before searching.
Parameters:
- query: (required) The search query to find relevant code. You should reuse the user's exact query/most recent message with their wording unless there is a clear reason not to.
- path: (optional) The path to the directory to search in relative to the current working directory. This parameter should only be a directory path, file paths are not supported. Defaults to the current working directory.
Usage:
<codebase_search>
<query>Your natural language query here</query>
<path>Path to the directory to search in (optional)</path>
</codebase_search>

Example: Searching for functions related to user authentication
<codebase_search>
<query>User login and password hashing</query>
<path>/path/to/directory</path>
</codebase_search>
`
}

src/core/tools/codebaseSearchTool.ts

	// --- Core Logic ---
	try {
		const context = cline.providerRef.deref()?.context
		if (!context) {
			throw new Error("Extension context is not available.")
		}

		const manager = CodeIndexManager.getInstance(context)

		if (!manager) {
			throw new Error("CodeIndexManager is not available.")
		}

		if (!manager.isFeatureEnabled) {
			throw new Error("Code Indexing is disabled in the settings.")
		}
		if (!manager.isFeatureConfigured) {
			throw new Error("Code Indexing is not configured (Missing OpenAI Key or Qdrant URL).")
		}

		const searchResults: VectorStoreSearchResult[] = await manager.searchIndex(query, directoryPrefix)

manager.searchIndex(query, directoryPrefix)：它会把 query 转成向量，再去向量数据库里找语义上最接近的代码片段。directoryPrefix 用于限制搜索范围。

directoryPrefix

	// --- Parameter Extraction and Validation ---
	let query: string | undefined = block.params.query
	let directoryPrefix: string | undefined = block.params.path

Qdrant 索引

	/**
	 * Updates the status of a file in the state manager.
	 */

	/**
	 * Initiates the indexing process (initial scan and starts watcher).
	 */
	public async startIndexing(): Promise<void> {
		if (!this.configManager.isFeatureConfigured) {
			this.stateManager.setSystemState("Standby", "Missing configuration. Save your settings to start indexing.")
			console.warn("[CodeIndexOrchestrator] Start rejected: Missing configuration.")
			return
		}

		if (
			this._isProcessing ||
			(this.stateManager.state !== "Standby" &&
				this.stateManager.state !== "Error" &&
				this.stateManager.state !== "Indexed")
		) {
			console.warn(
				`[CodeIndexOrchestrator] Start rejected: Already processing or in state ${this.stateManager.state}.`,
			)
			return
		}

		this._isProcessing = true
		this.stateManager.setSystemState("Indexing", "Initializing services...")

		try {
			const collectionCreated = await this.vectorStore.initialize()

			if (collectionCreated) {
				await this.cacheManager.clearCacheFile()
			}

			this.stateManager.setSystemState("Indexing", "Services ready. Starting workspace scan...")

			let cumulativeBlocksIndexed = 0
			let cumulativeBlocksFoundSoFar = 0

			const handleFileParsed = (fileBlockCount: number) => {
				cumulativeBlocksFoundSoFar += fileBlockCount
				this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
			}

			const handleBlocksIndexed = (indexedCount: number) => {
				cumulativeBlocksIndexed += indexedCount
				this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
			}

			const result = await this.scanner.scanDirectory(
				this.workspacePath,
				(batchError: Error) => {
					console.error(
						`[CodeIndexOrchestrator] Error during initial scan batch: ${batchError.message}`,
						batchError,
					)
				},
				handleBlocksIndexed,
				handleFileParsed,
			)

			if (!result) {
				throw new Error("Scan failed, is scanner initialized?")
			}

			const { stats } = result

			await this._startWatcher()

			this.stateManager.setSystemState("Indexed", "File watcher started.")
		} catch (error: any) {
			console.error("[CodeIndexOrchestrator] Error during indexing:", error)
			try {
				await this.vectorStore.clearCollection()
			} catch (cleanupError) {
				console.error("[CodeIndexOrchestrator] Failed to clean up after error:", cleanupError)
			}

			await this.cacheManager.clearCacheFile()

			this.stateManager.setSystemState("Error", `Failed during initial scan: ${error.message || "Unknown error"}`)
			this.stopWatcher()
		} finally {
			this._isProcessing = false
		}
	}

configManager：管理配置，检查是否已经配好。
stateManager：管理索引状态（如 “Standby”、“Indexing”、“Indexed”、“Error”）并报告进度。
vectorStore：存储代码块的向量表示（可能是嵌入向量，用于搜索或分析）。
cacheManager：管理缓存文件，保证索引数据一致。
scanner：扫描工作目录，解析文件并提取代码块。
_startWatcher 和 stopWatcher：管理文件变化的实时监控。

状态显示 File watcher started，但代码还是没被索引，VSCode Output 里也没有相关日志，只好转到本地开发调试。

本地开发调试

克隆仓库：

git clone https://github.com/RooCodeInc/Roo-Code.git

安装依赖：

npm run install:all

启动网页视图（Vite/React 应用，带热模块替换）：

npm run dev

调试：在 VSCode 中按 F5（或运行 → 开始调试）打开一个加载了 Roo Code 的新会话。

网页视图的更改将立即显示。核心扩展的更改将需要重启扩展主机。

或者，构建一个 .vsix 文件并直接在 VSCode 中安装：

npm run build

bin/ 目录中将出现一个 .vsix 文件，可以用以下命令安装：

code --install-extension bin/roo-cline-<version>.vsix

Release v3.21.1

[3.21.1] - 2025-06-19

Fix tree-sitter issues that were preventing codebase indexing from working correctly
Improve error handling for codebase search embeddings
Resolve MCP server execution on Windows with node version managers
Default ‘Enable MCP Server Creation’ to false
Rate limit correctly when starting a subtask (thanks @olweraltuve!)
```
Commit 9b18b14
```

Indexing

Indexing - Indexed 10200 / 20049 blocks found

{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "indexed_vectors_count": 0,
    "points_count": 8898,
    "segments_count": 8,
    "config": {
      "params": {
        "vectors": {
          "size": 768,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": null
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null,
      "strict_mode_config": {
        "enabled": false
      }
    },
    "payload_schema": {
      "pathSegments.3": {
        "data_type": "keyword",
        "points": 8476
      },
      "pathSegments.2": {
        "data_type": "keyword",
        "points": 8829
      },
      "pathSegments.1": {
        "data_type": "keyword",
        "points": 8897
      },
      "pathSegments.4": {
        "data_type": "keyword",
        "points": 5569
      },
      "pathSegments.0": {
        "data_type": "keyword",
        "points": 8898
      }
    }
  },
  "status": "ok",
  "time": 0.000070973
}

codeChunk：

{"filePath":"webview-ui/src/i18n/locales/pl/chat.json","codeChunk":"\t},\n\t\"contextCondense\": {\n\t\t\"title\": \"Kontekst skondensowany\",\n\t\t\"condensing\": \"Kondensowanie kontekstu...\",\n\t\t\"errorHeader\": \"Nie udało się skondensować kontekstu\",\n\t\t\"tokens\": \"tokeny\"\n\t},\n\t\"followUpSuggest\": {\n\t\t\"copyToInput\": \"Kopiuj do pola wprowadzania (lub Shift + kliknięcie)\"\n\t},\n\t\"announcement\": {\n\t\t\"title\": \"🎉 Roo Code {{version}} wydany\",\n\t\t\"description\": \"Roo Code {{version}} przynosi potężne nowe funkcje i ulepszenia na podstawie Twoich opinii.\",\n\t\t\"whatsNew\": \"Co nowego\",\n\t\t\"feature1\": \"<bold>Uruchomienie Roo Marketplace</bold> - Marketplace jest już dostępny! Odkrywaj i instaluj tryby oraz MCP łatwiej niż kiedykolwiek wcześniej.\",\n\t\t\"feature2\": \"<bold>Modele Gemini 2.5</bold> - Dodano wsparcie dla nowych modeli Gemini 2.5 Pro, Flash i Flash Lite.\",\n\t\t\"feature3\": \"<bold>Wsparcie dla plików Excel i więcej</bold> - Dodano wsparcie dla plików Excel (.xlsx) oraz liczne poprawki błędów i ulepszenia!\",\n\t\t\"hideButton\": \"Ukryj ogłoszenie\",\n\t\t\"detailsDiscussLinks\": \"Uzyskaj więcej szczegółów i dołącz do dyskusji na <discordLink>Discord</discordLink> i <redditLink>Reddit</redditLink> 🚀\"\n\t},\n\t\"browser\": {","startLine":214,"endLine":234,"pathSegments":{"0":"webview-ui","1":"src","2":"i18n","3":"locales","4":"pl","5":"chat.json"}}

pathSegments：

{
0:"webview-ui"
1:"src"
2:"i18n"
3:"locales"
4:"pl"
5:"chat.json"
}

这意味着什么？#

1. 向量数据库 Qdrant#

启动Qdrant容器#

2. Embedding 模型：nomic-embed-text#

下载 nomic-embed-text 模型#

3. 代码索引系统#

4. 索引使用#

本地开发调试#

Release v3.21.1#

[3.21.1] - 2025-06-19#

Indexing#

这意味着什么？