最近看到一篇挺有意思的文章:Why Cline Doesn’t Index Your Codebase (And Why That’s a Good Thing) - Cline Blog
文章里有个很生动的例子,让我重新思考了代码搜索的问题:
这意味着什么?
举个具体例子:你让Cline给支付处理函数添加错误处理。
基于RAG的方法:
- 在向量空间中搜索"payment"和"error"
- 检索恰好包含这些词的代码片段
- 可能会错过你团队构建的自定义错误处理框架
- 建议通用的try-catch块,不符合你的代码模式
Cline的方法:
- 定位支付处理函数
- 追踪其导入来找到你的错误处理工具
- 检查类似函数来理解你的模式
- 检查调用函数来理解错误契约
- 建议完全符合你架构的错误处理
区别在哪?对你代码库的连贯理解,而不是对所有文件的摘要级理解。
我觉得 Cline 描述的是理想状态:系统真正理解代码语义和调用关系,而不是只在片段里捞关键词。但现实里离这个目标还有距离。RAG 虽然不够“懂代码库”,但在快速捞信息、理解局部模块、配合文档熟悉项目时,反而更直接、更省事。
1. 向量数据库 Qdrant
启动Qdrant容器
- 拉取镜像
docker pull qdrant/qdrant
- 启动 Qdrant 容器服务
docker run -d \
--name qdrant_server \
-v $(pwd)/qdrant_storage:/qdrant/storage \
-p 6333:6333 \
qdrant/qdrant
- 从
qdrant/qdrant镜像创建一个名为qdrant_server的容器。 - 将宿主机的
$(pwd)/qdrant_storage目录挂载到容器的/qdrant/storage目录,以实现数据持久化。 - 将宿主机的
6333端口映射到容器的6333端口,以便通过宿主机访问 Qdrant 服务。 - 容器在后台运行,不会占用当前终端。
docker logs qdrant_server
可以看到下面日志:
yang@Yangless:~/OpenManus$ sudo docker logs qdrant_server
_ _
__ _ __| |_ __ __ _ _ __ | |_
/ _` |/ _` | '__/ _` | '_ \| __|
| (_| | (_| | | | (_| | | | | |_
\__, |\__,_|_| \__,_|_| |_|\__|
|_|
Version: 1.14.1, build: 530430fa
Access web UI at http://localhost:6333/dashboard
2025-06-18T02:04:25.098594Z INFO storage::content_manager::consensus::persistent: Initializing new raft state at ./storage/raft_state.json
2025-06-18T02:04:25.123907Z INFO qdrant: Distributed mode disabled
2025-06-18T02:04:25.124458Z INFO qdrant: Telemetry reporting enabled, id: f32569f3-098a-4496-ae66-0b0eba3bdee6
2025-06-18T02:04:25.125168Z INFO qdrant: Inference service is not configured.
2025-06-18T02:04:25.128123Z INFO qdrant::actix: TLS disabled for REST API
2025-06-18T02:04:25.128199Z INFO qdrant::actix: Qdrant HTTP listening on 6333
2025-06-18T02:04:25.128353Z INFO actix_server::builder: starting 7 workers
2025-06-18T02:04:25.128378Z INFO actix_server::server: Actix runtime found; starting in Actix runtime
2025-06-18T02:04:25.128381Z INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:6333", workers: 7, listening on: 0.0.0.0:6333
2025-06-18T02:04:25.130477Z INFO qdrant::tonic: Qdrant gRPC listening on 6334
2025-06-18T02:04:25.130493Z INFO qdrant::tonic: TLS disabled for gRPC API
2. Embedding 模型:nomic-embed-text
这里用 nomic-embed-text 做嵌入模型。它基于 Sentence Transformers,适合特征提取和句子相似度计算。我试下来,它在分类、检索、聚类这类任务里表现都不错,生成的句子嵌入质量也够用。
下载 nomic-embed-text 模型
用命令行下载很简单:
ollama pull nomic-embed-text
查看运行状况:
ollama list
yang@Yangless:~/OpenManus$ ollama list
NAME ID SIZE MODIFIED
nomic-embed-text:latest 0a109f422b47 274 MB 19 minutes ago
Qwen3-8B-BF16.gguf:latest 32620bdfde2a 16 GB 5 weeks ago
llama3.2:latest a80c4f17acd5 2.0 GB 5 weeks ago
在 WSL 环境下,需要先看一下 IP 地址,方便在 Windows 里调用:
yang@Yangless:~/OpenManus$ hostname -I
172.26.20.47 172.17.0.1
测试一下嵌入 API:
yang@Yangless:~/Roo-Code$ curl http://172.26.20.47:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Why is the sky blue?"
}'
{"model":"nomic-embed-text","embeddings":[[0.009776355,0.044323925,-0.14051996,0.0012110417,0.032160897,0.107437715,-0.008488253,0.010181047,0.0007287834,-0.035362013,0.033811368,0.062149946,0.102554426,0.08564908,0.02366101,0.033607,-0.03356383,-0.018563574,0.048045073,-0.026967347,-0.056341264,-0.04372835,0.016524935,-0.034954622,0.06359335,0.04324542,0.03344377,-0.0003335339,0.000013826987,-0.018919408,0.0580005,0.002397802,0.01843542,-0.037297793,0.032865062,-0.059681322,0.066892944,0.026862126,0.0063907662,-0.016804932,0.0021118193,-0.035640974,-0.010764647,0.008815547,0.022905584,-0.049736686,0.01517613,0.050247945,-0.022723474,-0.050869823,-0.040053748,0.05884125,0.0022002193,-0.07127981,0.029803565,0.03862788,0.0640547,-0.030312466,-0.022575086,0.026461482,0.04690114,0.08042046,0.048403896,0.08324791,0.03957621,-0.049871583,-0.045406476,-0.005782252,0.02759462,-0.0075440546,0.045668177,-0.06911605,0.018673178,0.045374103,-0.042050228,-0.037026647,-0.053222436,0.020874536,0.026776804,0.051039793,0.02964641,0.026097259,0.0006414953,0.0048449677,0.017336344,0.0285408,0.026794678,-0.0024107858,-0.022615556,-0.009006933,-0.0032530518,-0.033786412,0.047818445,-0.013370022,-0.022947403,0.010468716,-0.010219198,0.065975636,-0.055626307,0.03517316,-0.08174245,-0.05712367,-0.036576092,0.041095156,0.05164144,0.039824422,-0.0045150174,-0.009581259,-0.06830667,-0.030202862,0.0035076917,0.040202133,0.0028079743,0.020055877,0.00459663,-0.023843797,0.008705585,-0.05718572,0.0130439075,0.069874175,0.008697175,-0.009527806,-0.0009517014,-0.04519064,-0.016905937,0.030071337,-0.00024635645,-0.00685279,-0.01280379,-0.060193934,-0.027039854,-0.021235554,-0.0040082242,0.010798034,0.009165438,0.01550646,0.024715487,-0.045948766,0.018416198,0.04708325,0.033242777,0.007809381,0.05183029,-0.039924245,-0.014552399,-0.046967242,0.0035317836,0.0063597397,-0.0040324214,-0.04196457,0.0011257192,-0.010744918,0.03089556,0.021573136,0.0018400011,-0.044057503,0.012425403,0.00294616,0.03491685,0.048474044,0.08998199,0.00377599,-0.020732388,0.002766641,0.01025524,-0.027951697,0.027317068,0.013810083,0.035107058,0.035069287,-0.046491053,-0.038915213,-0.04343474,-0.03672313,0.023469456,0.019677488,0.053751234,0.016813617,0.045575097,-0.029662598,-0.011231897,-0.047562137,0.014721779,-0.0015712179,0.017200097,-0.0026507347,-0.05977575,-0.075682856,-0.024388783,0.0020140188,0.023157168,-0.010981663,-0.0703706,-0.038175974,-0.000820345,-0.0645781,0.030315502,0.06214725,0.0074186847,0.008182204,0.0046615493,-0.00065258745,0.016729895,0.00082525617,-0.017683705,0.053627126,0.0016933001,0.0635947,-0.03783148,0.008455034,0.05539968,-0.008906097,0.010003235,0.00542861,0.04302149,0.0017897517,-0.055445544,-0.038278498,0.000032586162,0.047680847,-0.020339161,-0.0019236375,0.034467645,-0.010021531,0.0338343,0.04207721,-0.055952758,0.015465064,0.027366644,0.0034861714,-0.01226066,-0.012835491,0.034091953,0.061497044,0.013582823,0.036172748,0.05555616,0.061696693,0.011708423,-0.0007737211,0.0185774,-0.012757757,0.020119278,0.024277491,-0.034151983,-0.009396955,-0.04150086,0.02026834]],"total_duration":1573287934,"load_duration":1385129951,"prompt_eval_count":6}
3. 代码索引系统
我设计的索引系统工作流程是这样的:
- 使用 Tree-sitter 解析代码,识别语义块(如函数、类、方法)
- 使用 AI 模型为每个代码块生成嵌入表示
- 将向量保存在 Qdrant 数据库中,以便进行快速相似性搜索
- 使用
codebase_search工具,实现智能代码发现
4. 索引使用
进入控制面板 http://localhost:6333/dashboard#/console
查看索引:
// List all collections
GET collections
// Get collection info
GET collections/ws-906c3fe5023f64b5
返回的配置信息:
{
"result": {
"status": "green",
"optimizer_status": "ok",
"indexed_vectors_count": 0,
"points_count": 0,
"segments_count": 8,
"config": {
"params": {
"vectors": {
"size": 768,
"distance": "Cosine"
},
"shard_number": 1,
"replication_factor": 1,
"write_consistency_factor": 1,
"on_disk_payload": true
},
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000,
"max_indexing_threads": 0,
"on_disk": false
},
"optimizer_config": {
"deleted_threshold": 0.2,
"vacuum_min_vector_number": 1000,
"default_segment_number": 0,
"max_segment_size": null,
"memmap_threshold": null,
"indexing_threshold": 20000,
"flush_interval_sec": 5,
"max_optimization_threads": null
},
"wal_config": {
"wal_capacity_mb": 32,
"wal_segments_ahead": 0
},
"quantization_config": null,
"strict_mode_config": {
"enabled": false
}
},
"payload_schema": {
"pathSegments.2": {
"data_type": "keyword",
"points": 0
},
"pathSegments.1": {
"data_type": "keyword",
"points": 0
},
"pathSegments.0": {
"data_type": "keyword",
"points": 0
},
"pathSegments.4": {
"data_type": "keyword",
"points": 0
},
"pathSegments.3": {
"data_type": "keyword",
"points": 0
}
}
},
"status": "ok",
"time": 0.000075857
}
这里开始出问题:代码没有被索引,只能继续翻源码。
源码实现:
src/core/prompts/tools/codebase-search.ts
它提供了 codebase_search 工具的 Markdown 描述、参数说明和使用示例,也就是 AI 在工具定义里能看到的内容。
export function getCodebaseSearchDescription(): string {
return `## codebase_search
Description: Find files most relevant to the search query.\nThis is a semantic search tool, so the query should ask for something semantically matching what is needed.\nIf it makes sense to only search in a particular directory, please specify it in the path parameter.\nUnless there is a clear reason to use your own search query, please just reuse the user's exact query with their wording.\nTheir exact wording/phrasing can often be helpful for the semantic search query. Keeping the same exact question format can also be helpful.\nIMPORTANT: Queries MUST be in English. Translate non-English queries before searching.
Parameters:
- query: (required) The search query to find relevant code. You should reuse the user's exact query/most recent message with their wording unless there is a clear reason not to.
- path: (optional) The path to the directory to search in relative to the current working directory. This parameter should only be a directory path, file paths are not supported. Defaults to the current working directory.
Usage:
<codebase_search>
<query>Your natural language query here</query>
<path>Path to the directory to search in (optional)</path>
</codebase_search>
Example: Searching for functions related to user authentication
<codebase_search>
<query>User login and password hashing</query>
<path>/path/to/directory</path>
</codebase_search>
`
}
src/core/tools/codebaseSearchTool.ts
// --- Core Logic ---
try {
const context = cline.providerRef.deref()?.context
if (!context) {
throw new Error("Extension context is not available.")
}
const manager = CodeIndexManager.getInstance(context)
if (!manager) {
throw new Error("CodeIndexManager is not available.")
}
if (!manager.isFeatureEnabled) {
throw new Error("Code Indexing is disabled in the settings.")
}
if (!manager.isFeatureConfigured) {
throw new Error("Code Indexing is not configured (Missing OpenAI Key or Qdrant URL).")
}
const searchResults: VectorStoreSearchResult[] = await manager.searchIndex(query, directoryPrefix)
manager.searchIndex(query, directoryPrefix):它会把 query 转成向量,再去向量数据库里找语义上最接近的代码片段。directoryPrefix 用于限制搜索范围。
directoryPrefix
// --- Parameter Extraction and Validation ---
let query: string | undefined = block.params.query
let directoryPrefix: string | undefined = block.params.path
Qdrant 索引
/**
* Updates the status of a file in the state manager.
*/
/**
* Initiates the indexing process (initial scan and starts watcher).
*/
public async startIndexing(): Promise<void> {
if (!this.configManager.isFeatureConfigured) {
this.stateManager.setSystemState("Standby", "Missing configuration. Save your settings to start indexing.")
console.warn("[CodeIndexOrchestrator] Start rejected: Missing configuration.")
return
}
if (
this._isProcessing ||
(this.stateManager.state !== "Standby" &&
this.stateManager.state !== "Error" &&
this.stateManager.state !== "Indexed")
) {
console.warn(
`[CodeIndexOrchestrator] Start rejected: Already processing or in state ${this.stateManager.state}.`,
)
return
}
this._isProcessing = true
this.stateManager.setSystemState("Indexing", "Initializing services...")
try {
const collectionCreated = await this.vectorStore.initialize()
if (collectionCreated) {
await this.cacheManager.clearCacheFile()
}
this.stateManager.setSystemState("Indexing", "Services ready. Starting workspace scan...")
let cumulativeBlocksIndexed = 0
let cumulativeBlocksFoundSoFar = 0
const handleFileParsed = (fileBlockCount: number) => {
cumulativeBlocksFoundSoFar += fileBlockCount
this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
}
const handleBlocksIndexed = (indexedCount: number) => {
cumulativeBlocksIndexed += indexedCount
this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
}
const result = await this.scanner.scanDirectory(
this.workspacePath,
(batchError: Error) => {
console.error(
`[CodeIndexOrchestrator] Error during initial scan batch: ${batchError.message}`,
batchError,
)
},
handleBlocksIndexed,
handleFileParsed,
)
if (!result) {
throw new Error("Scan failed, is scanner initialized?")
}
const { stats } = result
await this._startWatcher()
this.stateManager.setSystemState("Indexed", "File watcher started.")
} catch (error: any) {
console.error("[CodeIndexOrchestrator] Error during indexing:", error)
try {
await this.vectorStore.clearCollection()
} catch (cleanupError) {
console.error("[CodeIndexOrchestrator] Failed to clean up after error:", cleanupError)
}
await this.cacheManager.clearCacheFile()
this.stateManager.setSystemState("Error", `Failed during initial scan: ${error.message || "Unknown error"}`)
this.stopWatcher()
} finally {
this._isProcessing = false
}
}
- configManager:管理配置,检查是否已经配好。
- stateManager:管理索引状态(如 “Standby”、“Indexing”、“Indexed”、“Error”)并报告进度。
- vectorStore:存储代码块的向量表示(可能是嵌入向量,用于搜索或分析)。
- cacheManager:管理缓存文件,保证索引数据一致。
- scanner:扫描工作目录,解析文件并提取代码块。
- _startWatcher 和 stopWatcher:管理文件变化的实时监控。
状态显示 File watcher started,但代码还是没被索引,VSCode Output 里也没有相关日志,只好转到本地开发调试。
本地开发调试
克隆仓库:
git clone https://github.com/RooCodeInc/Roo-Code.git
安装依赖:
npm run install:all
启动网页视图(Vite/React 应用,带热模块替换):
npm run dev
调试:在 VSCode 中按 F5(或运行 → 开始调试)打开一个加载了 Roo Code 的新会话。
网页视图的更改将立即显示。核心扩展的更改将需要重启扩展主机。
或者,构建一个 .vsix 文件并直接在 VSCode 中安装:
npm run build
bin/ 目录中将出现一个 .vsix 文件,可以用以下命令安装:
code --install-extension bin/roo-cline-<version>.vsix
Release v3.21.1
[3.21.1] - 2025-06-19
Fix tree-sitter issues that were preventing codebase indexing from working correctly
Improve error handling for codebase search embeddings
Resolve MCP server execution on Windows with node version managers
Default ‘Enable MCP Server Creation’ to false
Rate limit correctly when starting a subtask (thanks @olweraltuve!)
Commit 9b18b14
Indexing
Indexing - Indexed 10200 / 20049 blocks found
{
"result": {
"status": "green",
"optimizer_status": "ok",
"indexed_vectors_count": 0,
"points_count": 8898,
"segments_count": 8,
"config": {
"params": {
"vectors": {
"size": 768,
"distance": "Cosine"
},
"shard_number": 1,
"replication_factor": 1,
"write_consistency_factor": 1,
"on_disk_payload": true
},
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000,
"max_indexing_threads": 0,
"on_disk": false
},
"optimizer_config": {
"deleted_threshold": 0.2,
"vacuum_min_vector_number": 1000,
"default_segment_number": 0,
"max_segment_size": null,
"memmap_threshold": null,
"indexing_threshold": 20000,
"flush_interval_sec": 5,
"max_optimization_threads": null
},
"wal_config": {
"wal_capacity_mb": 32,
"wal_segments_ahead": 0
},
"quantization_config": null,
"strict_mode_config": {
"enabled": false
}
},
"payload_schema": {
"pathSegments.3": {
"data_type": "keyword",
"points": 8476
},
"pathSegments.2": {
"data_type": "keyword",
"points": 8829
},
"pathSegments.1": {
"data_type": "keyword",
"points": 8897
},
"pathSegments.4": {
"data_type": "keyword",
"points": 5569
},
"pathSegments.0": {
"data_type": "keyword",
"points": 8898
}
}
},
"status": "ok",
"time": 0.000070973
}
codeChunk:
{"filePath":"webview-ui/src/i18n/locales/pl/chat.json","codeChunk":"\t},\n\t\"contextCondense\": {\n\t\t\"title\": \"Kontekst skondensowany\",\n\t\t\"condensing\": \"Kondensowanie kontekstu...\",\n\t\t\"errorHeader\": \"Nie udało się skondensować kontekstu\",\n\t\t\"tokens\": \"tokeny\"\n\t},\n\t\"followUpSuggest\": {\n\t\t\"copyToInput\": \"Kopiuj do pola wprowadzania (lub Shift + kliknięcie)\"\n\t},\n\t\"announcement\": {\n\t\t\"title\": \"🎉 Roo Code {{version}} wydany\",\n\t\t\"description\": \"Roo Code {{version}} przynosi potężne nowe funkcje i ulepszenia na podstawie Twoich opinii.\",\n\t\t\"whatsNew\": \"Co nowego\",\n\t\t\"feature1\": \"<bold>Uruchomienie Roo Marketplace</bold> - Marketplace jest już dostępny! Odkrywaj i instaluj tryby oraz MCP łatwiej niż kiedykolwiek wcześniej.\",\n\t\t\"feature2\": \"<bold>Modele Gemini 2.5</bold> - Dodano wsparcie dla nowych modeli Gemini 2.5 Pro, Flash i Flash Lite.\",\n\t\t\"feature3\": \"<bold>Wsparcie dla plików Excel i więcej</bold> - Dodano wsparcie dla plików Excel (.xlsx) oraz liczne poprawki błędów i ulepszenia!\",\n\t\t\"hideButton\": \"Ukryj ogłoszenie\",\n\t\t\"detailsDiscussLinks\": \"Uzyskaj więcej szczegółów i dołącz do dyskusji na <discordLink>Discord</discordLink> i <redditLink>Reddit</redditLink> 🚀\"\n\t},\n\t\"browser\": {","startLine":214,"endLine":234,"pathSegments":{"0":"webview-ui","1":"src","2":"i18n","3":"locales","4":"pl","5":"chat.json"}}
pathSegments:
{
0:"webview-ui"
1:"src"
2:"i18n"
3:"locales"
4:"pl"
5:"chat.json"
}