代码库搜索技术探索

最近看到一篇很有意思的文章:Why Cline Doesn’t Index Your Codebase (And Why That’s a Good Thing) - Cline Blog

文章里有个很生动的例子,让我重新思考了代码搜索的问题:

这意味着什么?

举个具体例子:你让Cline给支付处理函数添加错误处理。

基于RAG的方法:

  1. 在向量空间中搜索"payment"和"error"
  2. 检索恰好包含这些词的代码片段
  3. 可能会错过你团队构建的自定义错误处理框架
  4. 建议通用的try-catch块,不符合你的代码模式

Cline的方法:

  1. 定位支付处理函数
  2. 追踪其导入来找到你的错误处理工具
  3. 检查类似函数来理解你的模式
  4. 检查调用函数来理解错误契约
  5. 建议完全符合你架构的错误处理

区别在哪?对你代码库的连贯理解,而不是对所有文件的摘要级理解。

我觉得Cline描述的确实是一个理想状态——具备深度代码语义理解和分析能力的系统,但现实中还有很长的路要走。不过RAG在快速获取信息、理解特定功能或模块的局部代码,以及结合文档来学习项目方面,确实更直接、更高效。

1.向量数据库Qdrant

启动Qdrant容器

1.拉取镜像

docker pull qdrant/qdrant

2.启动qdrant容器服务

docker run -d \
    --name qdrant_server \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    -p 6333:6333 \
    qdrant/qdrant
  • qdrant/qdrant 镜像创建一个名为 qdrant_server 的容器。
  • 将宿主机的 $(pwd)/qdrant_storage 目录挂载到容器的 /qdrant/storage 目录,以实现数据持久化。
  • 将宿主机的 6333 端口映射到容器的 6333 端口,以便通过宿主机访问 Qdrant 服务。
  • 容器在后台运行,不会占用当前终端。
docker logs qdrant_server

可以看到下面日志:

yang@Yangless:~/OpenManus$ sudo docker logs qdrant_server
           _                 _
  __ _  __| |_ __ __ _ _ __ | |_
 / _` |/ _` | '__/ _` | '_ \| __|
| (_| | (_| | | | (_| | | | | |_
 \__, |\__,_|_|  \__,_|_| |_|\__|
    |_|

Version: 1.14.1, build: 530430fa
Access web UI at http://localhost:6333/dashboard

2025-06-18T02:04:25.098594Z  INFO storage::content_manager::consensus::persistent: Initializing new raft state at ./storage/raft_state.json
2025-06-18T02:04:25.123907Z  INFO qdrant: Distributed mode disabled
2025-06-18T02:04:25.124458Z  INFO qdrant: Telemetry reporting enabled, id: f32569f3-098a-4496-ae66-0b0eba3bdee6
2025-06-18T02:04:25.125168Z  INFO qdrant: Inference service is not configured.
2025-06-18T02:04:25.128123Z  INFO qdrant::actix: TLS disabled for REST API
2025-06-18T02:04:25.128199Z  INFO qdrant::actix: Qdrant HTTP listening on 6333
2025-06-18T02:04:25.128353Z  INFO actix_server::builder: starting 7 workers
2025-06-18T02:04:25.128378Z  INFO actix_server::server: Actix runtime found; starting in Actix runtime
2025-06-18T02:04:25.128381Z  INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:6333", workers: 7, listening on: 0.0.0.0:6333
2025-06-18T02:04:25.130477Z  INFO qdrant::tonic: Qdrant gRPC listening on 6334
2025-06-18T02:04:25.130493Z  INFO qdrant::tonic: TLS disabled for gRPC API

2.Embedding模型:nomic-embed-text

我选择了nomic-embed-text作为嵌入模型。这是一个基于Sentence Transformers库的句子嵌入模型,专门用于特征提取和句子相似度计算。我发现它在分类、检索和聚类任务中表现都很不错,特别是能够生成高质量的句子嵌入,在语义相似度计算方面很有优势。

下载nomic-embed-text模型

用命令行下载很简单:

ollama pull nomic-embed-text

查看运行状况:

ollama list
yang@Yangless:~/OpenManus$ ollama list
NAME                         ID              SIZE      MODIFIED
nomic-embed-text:latest      0a109f422b47    274 MB    19 minutes ago
Qwen3-8B-BF16.gguf:latest    32620bdfde2a    16 GB     5 weeks ago
llama3.2:latest              a80c4f17acd5    2.0 GB    5 weeks ago

在WSL环境下,我需要查看IP地址以便在Windows中调用:

yang@Yangless:~/OpenManus$ hostname -I
172.26.20.47 172.17.0.1

测试一下嵌入API:

yang@Yangless:~/Roo-Code$ curl http://172.26.20.47:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Why is the sky blue?"
}'
{"model":"nomic-embed-text","embeddings":[[0.009776355,0.044323925,-0.14051996,0.0012110417,0.032160897,0.107437715,-0.008488253,0.010181047,0.0007287834,-0.035362013,0.033811368,0.062149946,0.102554426,0.08564908,0.02366101,0.033607,-0.03356383,-0.018563574,0.048045073,-0.026967347,-0.056341264,-0.04372835,0.016524935,-0.034954622,0.06359335,0.04324542,0.03344377,-0.0003335339,0.000013826987,-0.018919408,0.0580005,0.002397802,0.01843542,-0.037297793,0.032865062,-0.059681322,0.066892944,0.026862126,0.0063907662,-0.016804932,0.0021118193,-0.035640974,-0.010764647,0.008815547,0.022905584,-0.049736686,0.01517613,0.050247945,-0.022723474,-0.050869823,-0.040053748,0.05884125,0.0022002193,-0.07127981,0.029803565,0.03862788,0.0640547,-0.030312466,-0.022575086,0.026461482,0.04690114,0.08042046,0.048403896,0.08324791,0.03957621,-0.049871583,-0.045406476,-0.005782252,0.02759462,-0.0075440546,0.045668177,-0.06911605,0.018673178,0.045374103,-0.042050228,-0.037026647,-0.053222436,0.020874536,0.026776804,0.051039793,0.02964641,0.026097259,0.0006414953,0.0048449677,0.017336344,0.0285408,0.026794678,-0.0024107858,-0.022615556,-0.009006933,-0.0032530518,-0.033786412,0.047818445,-0.013370022,-0.022947403,0.010468716,-0.010219198,0.065975636,-0.055626307,0.03517316,-0.08174245,-0.05712367,-0.036576092,0.041095156,0.05164144,0.039824422,-0.0045150174,-0.009581259,-0.06830667,-0.030202862,0.0035076917,0.040202133,0.0028079743,0.020055877,0.00459663,-0.023843797,0.008705585,-0.05718572,0.0130439075,0.069874175,0.008697175,-0.009527806,-0.0009517014,-0.04519064,-0.016905937,0.030071337,-0.00024635645,-0.00685279,-0.01280379,-0.060193934,-0.027039854,-0.021235554,-0.0040082242,0.010798034,0.009165438,0.01550646,0.024715487,-0.045948766,0.018416198,0.04708325,0.033242777,0.007809381,0.05183029,-0.039924245,-0.014552399,-0.046967242,0.0035317836,0.0063597397,-0.0040324214,-0.04196457,0.0011257192,-0.010744918,0.03089556,0.021573136,0.0018400011,-0.044057503,0.012425403,0.00294616,0.03491685,0.048474044,0.08998199,0.00377599,-0.020732388,0.002766641,0.01025524,-0.027951697,0.027317068,0.013810083,0.035107058,0.035069287,-0.046491053,-0.038915213,-0.04343474,-0.03672313,0.023469456,0.019677488,0.053751234,0.016813617,0.045575097,-0.029662598,-0.011231897,-0.047562137,0.014721779,-0.0015712179,0.017200097,-0.0026507347,-0.05977575,-0.075682856,-0.024388783,0.0020140188,0.023157168,-0.010981663,-0.0703706,-0.038175974,-0.000820345,-0.0645781,0.030315502,0.06214725,0.0074186847,0.008182204,0.0046615493,-0.00065258745,0.016729895,0.00082525617,-0.017683705,0.053627126,0.0016933001,0.0635947,-0.03783148,0.008455034,0.05539968,-0.008906097,0.010003235,0.00542861,0.04302149,0.0017897517,-0.055445544,-0.038278498,0.000032586162,0.047680847,-0.020339161,-0.0019236375,0.034467645,-0.010021531,0.0338343,0.04207721,-0.055952758,0.015465064,0.027366644,0.0034861714,-0.01226066,-0.012835491,0.034091953,0.061497044,0.013582823,0.036172748,0.05555616,0.061696693,0.011708423,-0.0007737211,0.0185774,-0.012757757,0.020119278,0.024277491,-0.034151983,-0.009396955,-0.04150086,0.02026834]],"total_duration":1573287934,"load_duration":1385129951,"prompt_eval_count":6}

3.代码索引系统

我设计的索引系统工作流程是这样的:

  1. 使用Tree-sitter解析代码,识别语义块(如函数、类、方法)
  2. 使用AI模型为每个代码块生成嵌入表示
  3. 将向量保存在Qdrant数据库中,以便进行快速相似性搜索
  4. 使用codebase_search工具,实现智能代码发现

4.索引使用

进入控制面板http://localhost:6333/dashboard#/console

查看索引:

// List all collections
GET collections

// Get collection info
GET collections/ws-906c3fe5023f64b5

返回的配置信息:

{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "indexed_vectors_count": 0,
    "points_count": 0,
    "segments_count": 8,
    "config": {
      "params": {
        "vectors": {
          "size": 768,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": null
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null,
      "strict_mode_config": {
        "enabled": false
      }
    },
    "payload_schema": {
      "pathSegments.2": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.1": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.0": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.4": {
        "data_type": "keyword",
        "points": 0
      },
      "pathSegments.3": {
        "data_type": "keyword",
        "points": 0
      }
    }
  },
  "status": "ok",
  "time": 0.000075857
}

出现问题,代码不被索引,查看源码

源码实现:

src/core/prompts/tools/codebase-search.ts

它提供了 codebase_search 工具的 Markdown 格式描述、参数说明和使用示例。这是 AI 在其工具定义中看到的内容。

export function getCodebaseSearchDescription(): string {
	return `## codebase_search
Description: Find files most relevant to the search query.\nThis is a semantic search tool, so the query should ask for something semantically matching what is needed.\nIf it makes sense to only search in a particular directory, please specify it in the path parameter.\nUnless there is a clear reason to use your own search query, please just reuse the user's exact query with their wording.\nTheir exact wording/phrasing can often be helpful for the semantic search query. Keeping the same exact question format can also be helpful.\nIMPORTANT: Queries MUST be in English. Translate non-English queries before searching.
Parameters:
- query: (required) The search query to find relevant code. You should reuse the user's exact query/most recent message with their wording unless there is a clear reason not to.
- path: (optional) The path to the directory to search in relative to the current working directory. This parameter should only be a directory path, file paths are not supported. Defaults to the current working directory.
Usage:
<codebase_search>
<query>Your natural language query here</query>
<path>Path to the directory to search in (optional)</path>
</codebase_search>

Example: Searching for functions related to user authentication
<codebase_search>
<query>User login and password hashing</query>
<path>/path/to/directory</path>
</codebase_search>
`
}

src/core/tools/codebaseSearchTool.ts

	// --- Core Logic ---
	try {
		const context = cline.providerRef.deref()?.context
		if (!context) {
			throw new Error("Extension context is not available.")
		}

		const manager = CodeIndexManager.getInstance(context)

		if (!manager) {
			throw new Error("CodeIndexManager is not available.")
		}

		if (!manager.isFeatureEnabled) {
			throw new Error("Code Indexing is disabled in the settings.")
		}
		if (!manager.isFeatureConfigured) {
			throw new Error("Code Indexing is not configured (Missing OpenAI Key or Qdrant URL).")
		}

		const searchResults: VectorStoreSearchResult[] = await manager.searchIndex(query, directoryPrefix)

manager.searchIndex(query, directoryPrefix): 它将 query 转换成向量,然后去向量数据库中查找语义上最相似的代码片段。directoryPrefix 用于限制搜索范围。

directoryPrefix

	// --- Parameter Extraction and Validation ---
	let query: string | undefined = block.params.query
	let directoryPrefix: string | undefined = block.params.path

Qdrant 索引

	/**
	 * Updates the status of a file in the state manager.
	 */

	/**
	 * Initiates the indexing process (initial scan and starts watcher).
	 */
	public async startIndexing(): Promise<void> {
		if (!this.configManager.isFeatureConfigured) {
			this.stateManager.setSystemState("Standby", "Missing configuration. Save your settings to start indexing.")
			console.warn("[CodeIndexOrchestrator] Start rejected: Missing configuration.")
			return
		}

		if (
			this._isProcessing ||
			(this.stateManager.state !== "Standby" &&
				this.stateManager.state !== "Error" &&
				this.stateManager.state !== "Indexed")
		) {
			console.warn(
				`[CodeIndexOrchestrator] Start rejected: Already processing or in state ${this.stateManager.state}.`,
			)
			return
		}

		this._isProcessing = true
		this.stateManager.setSystemState("Indexing", "Initializing services...")

		try {
			const collectionCreated = await this.vectorStore.initialize()

			if (collectionCreated) {
				await this.cacheManager.clearCacheFile()
			}

			this.stateManager.setSystemState("Indexing", "Services ready. Starting workspace scan...")

			let cumulativeBlocksIndexed = 0
			let cumulativeBlocksFoundSoFar = 0

			const handleFileParsed = (fileBlockCount: number) => {
				cumulativeBlocksFoundSoFar += fileBlockCount
				this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
			}

			const handleBlocksIndexed = (indexedCount: number) => {
				cumulativeBlocksIndexed += indexedCount
				this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
			}

			const result = await this.scanner.scanDirectory(
				this.workspacePath,
				(batchError: Error) => {
					console.error(
						`[CodeIndexOrchestrator] Error during initial scan batch: ${batchError.message}`,
						batchError,
					)
				},
				handleBlocksIndexed,
				handleFileParsed,
			)

			if (!result) {
				throw new Error("Scan failed, is scanner initialized?")
			}

			const { stats } = result

			await this._startWatcher()

			this.stateManager.setSystemState("Indexed", "File watcher started.")
		} catch (error: any) {
			console.error("[CodeIndexOrchestrator] Error during indexing:", error)
			try {
				await this.vectorStore.clearCollection()
			} catch (cleanupError) {
				console.error("[CodeIndexOrchestrator] Failed to clean up after error:", cleanupError)
			}

			await this.cacheManager.clearCacheFile()

			this.stateManager.setSystemState("Error", `Failed during initial scan: ${error.message || "Unknown error"}`)
			this.stopWatcher()
		} finally {
			this._isProcessing = false
		}
	}
  • configManager:管理配置,检查是否已正确配置。
  • stateManager:管理索引状态(如 “Standby”、“Indexing”、“Indexed”、“Error”)并报告进度。
  • vectorStore:存储代码块的向量表示(可能是嵌入向量,用于搜索或分析)。
  • cacheManager:管理缓存文件,确保索引数据一致。
  • scanner:扫描工作目录,解析文件并提取代码块。
  • _startWatcher 和 stopWatcher:管理文件变化的实时监控。

出现File watcher started,但是并没有索引代码,vscode output 没有相关日志,进行本地设置开发

本地开发调试

克隆仓库:

git clone https://github.com/RooCodeInc/Roo-Code.git

安装依赖:

npm run install:all

启动网页视图(Vite/React 应用,带热模块替换):

npm run dev

调试: 在 VSCode 中按 F5(或运行 → 开始调试)打开一个加载了 Roo Code 的新会话。

网页视图的更改将立即显示。核心扩展的更改将需要重启扩展主机。

或者,构建一个 .vsix 文件并直接在 VSCode 中安装:

npm run build

bin/ 目录中将出现一个 .vsix 文件,可以用以下命令安装:

code --install-extension bin/roo-cline-<version>.vsix

Release v3.21.1

[3.21.1] - 2025-06-19

  • Fix tree-sitter issues that were preventing codebase indexing from working correctly

  • Improve error handling for codebase search embeddings

  • Resolve MCP server execution on Windows with node version managers

  • Default ‘Enable MCP Server Creation’ to false

  • Rate limit correctly when starting a subtask (thanks @olweraltuve!)

    Commit 9b18b14
    

indexing

Indexing - Indexed 10200 / 20049 blocks found

{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "indexed_vectors_count": 0,
    "points_count": 8898,
    "segments_count": 8,
    "config": {
      "params": {
        "vectors": {
          "size": 768,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": null
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null,
      "strict_mode_config": {
        "enabled": false
      }
    },
    "payload_schema": {
      "pathSegments.3": {
        "data_type": "keyword",
        "points": 8476
      },
      "pathSegments.2": {
        "data_type": "keyword",
        "points": 8829
      },
      "pathSegments.1": {
        "data_type": "keyword",
        "points": 8897
      },
      "pathSegments.4": {
        "data_type": "keyword",
        "points": 5569
      },
      "pathSegments.0": {
        "data_type": "keyword",
        "points": 8898
      }
    }
  },
  "status": "ok",
  "time": 0.000070973
}

codeChunk:

{"filePath":"webview-ui/src/i18n/locales/pl/chat.json","codeChunk":"\t},\n\t\"contextCondense\": {\n\t\t\"title\": \"Kontekst skondensowany\",\n\t\t\"condensing\": \"Kondensowanie kontekstu...\",\n\t\t\"errorHeader\": \"Nie udało się skondensować kontekstu\",\n\t\t\"tokens\": \"tokeny\"\n\t},\n\t\"followUpSuggest\": {\n\t\t\"copyToInput\": \"Kopiuj do pola wprowadzania (lub Shift + kliknięcie)\"\n\t},\n\t\"announcement\": {\n\t\t\"title\": \"🎉 Roo Code {{version}} wydany\",\n\t\t\"description\": \"Roo Code {{version}} przynosi potężne nowe funkcje i ulepszenia na podstawie Twoich opinii.\",\n\t\t\"whatsNew\": \"Co nowego\",\n\t\t\"feature1\": \"<bold>Uruchomienie Roo Marketplace</bold> - Marketplace jest już dostępny! Odkrywaj i instaluj tryby oraz MCP łatwiej niż kiedykolwiek wcześniej.\",\n\t\t\"feature2\": \"<bold>Modele Gemini 2.5</bold> - Dodano wsparcie dla nowych modeli Gemini 2.5 Pro, Flash i Flash Lite.\",\n\t\t\"feature3\": \"<bold>Wsparcie dla plików Excel i więcej</bold> - Dodano wsparcie dla plików Excel (.xlsx) oraz liczne poprawki błędów i ulepszenia!\",\n\t\t\"hideButton\": \"Ukryj ogłoszenie\",\n\t\t\"detailsDiscussLinks\": \"Uzyskaj więcej szczegółów i dołącz do dyskusji na <discordLink>Discord</discordLink> i <redditLink>Reddit</redditLink> 🚀\"\n\t},\n\t\"browser\": {","startLine":214,"endLine":234,"pathSegments":{"0":"webview-ui","1":"src","2":"i18n","3":"locales","4":"pl","5":"chat.json"}}

pathSegments:

{
0:"webview-ui"
1:"src"
2:"i18n"
3:"locales"
4:"pl"
5:"chat.json"
}