代码库搜索技术探索
最近看到一篇很有意思的文章:Why Cline Doesn’t Index Your Codebase (And Why That’s a Good Thing) - Cline Blog
文章里有个很生动的例子,让我重新思考了代码搜索的问题:
这意味着什么?
举个具体例子:你让Cline给支付处理函数添加错误处理。
基于RAG的方法:
- 在向量空间中搜索"payment"和"error"
- 检索恰好包含这些词的代码片段
- 可能会错过你团队构建的自定义错误处理框架
- 建议通用的try-catch块,不符合你的代码模式
Cline的方法:
- 定位支付处理函数
- 追踪其导入来找到你的错误处理工具
- 检查类似函数来理解你的模式
- 检查调用函数来理解错误契约
- 建议完全符合你架构的错误处理
区别在哪?对你代码库的连贯理解,而不是对所有文件的摘要级理解。
我觉得Cline描述的确实是一个理想状态——具备深度代码语义理解和分析能力的系统,但现实中还有很长的路要走。不过RAG在快速获取信息、理解特定功能或模块的局部代码,以及结合文档来学习项目方面,确实更直接、更高效。
1.向量数据库Qdrant
启动Qdrant容器
1.拉取镜像
docker pull qdrant/qdrant
2.启动qdrant容器服务
docker run -d \
--name qdrant_server \
-v $(pwd)/qdrant_storage:/qdrant/storage \
-p 6333:6333 \
qdrant/qdrant
- 从
qdrant/qdrant镜像创建一个名为qdrant_server的容器。 - 将宿主机的
$(pwd)/qdrant_storage目录挂载到容器的/qdrant/storage目录,以实现数据持久化。 - 将宿主机的
6333端口映射到容器的6333端口,以便通过宿主机访问 Qdrant 服务。 - 容器在后台运行,不会占用当前终端。
docker logs qdrant_server
可以看到下面日志:
yang@Yangless:~/OpenManus$ sudo docker logs qdrant_server
_ _
__ _ __| |_ __ __ _ _ __ | |_
/ _` |/ _` | '__/ _` | '_ \| __|
| (_| | (_| | | | (_| | | | | |_
\__, |\__,_|_| \__,_|_| |_|\__|
|_|
Version: 1.14.1, build: 530430fa
Access web UI at http://localhost:6333/dashboard
2025-06-18T02:04:25.098594Z INFO storage::content_manager::consensus::persistent: Initializing new raft state at ./storage/raft_state.json
2025-06-18T02:04:25.123907Z INFO qdrant: Distributed mode disabled
2025-06-18T02:04:25.124458Z INFO qdrant: Telemetry reporting enabled, id: f32569f3-098a-4496-ae66-0b0eba3bdee6
2025-06-18T02:04:25.125168Z INFO qdrant: Inference service is not configured.
2025-06-18T02:04:25.128123Z INFO qdrant::actix: TLS disabled for REST API
2025-06-18T02:04:25.128199Z INFO qdrant::actix: Qdrant HTTP listening on 6333
2025-06-18T02:04:25.128353Z INFO actix_server::builder: starting 7 workers
2025-06-18T02:04:25.128378Z INFO actix_server::server: Actix runtime found; starting in Actix runtime
2025-06-18T02:04:25.128381Z INFO actix_server::server: starting service: "actix-web-service-0.0.0.0:6333", workers: 7, listening on: 0.0.0.0:6333
2025-06-18T02:04:25.130477Z INFO qdrant::tonic: Qdrant gRPC listening on 6334
2025-06-18T02:04:25.130493Z INFO qdrant::tonic: TLS disabled for gRPC API
2.Embedding模型:nomic-embed-text
我选择了nomic-embed-text作为嵌入模型。这是一个基于Sentence Transformers库的句子嵌入模型,专门用于特征提取和句子相似度计算。我发现它在分类、检索和聚类任务中表现都很不错,特别是能够生成高质量的句子嵌入,在语义相似度计算方面很有优势。
下载nomic-embed-text模型
用命令行下载很简单:
ollama pull nomic-embed-text
查看运行状况:
ollama list
yang@Yangless:~/OpenManus$ ollama list
NAME ID SIZE MODIFIED
nomic-embed-text:latest 0a109f422b47 274 MB 19 minutes ago
Qwen3-8B-BF16.gguf:latest 32620bdfde2a 16 GB 5 weeks ago
llama3.2:latest a80c4f17acd5 2.0 GB 5 weeks ago
在WSL环境下,我需要查看IP地址以便在Windows中调用:
yang@Yangless:~/OpenManus$ hostname -I
172.26.20.47 172.17.0.1
测试一下嵌入API:
yang@Yangless:~/Roo-Code$ curl http://172.26.20.47:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Why is the sky blue?"
}'
{"model":"nomic-embed-text","embeddings":[[0.009776355,0.044323925,-0.14051996,0.0012110417,0.032160897,0.107437715,-0.008488253,0.010181047,0.0007287834,-0.035362013,0.033811368,0.062149946,0.102554426,0.08564908,0.02366101,0.033607,-0.03356383,-0.018563574,0.048045073,-0.026967347,-0.056341264,-0.04372835,0.016524935,-0.034954622,0.06359335,0.04324542,0.03344377,-0.0003335339,0.000013826987,-0.018919408,0.0580005,0.002397802,0.01843542,-0.037297793,0.032865062,-0.059681322,0.066892944,0.026862126,0.0063907662,-0.016804932,0.0021118193,-0.035640974,-0.010764647,0.008815547,0.022905584,-0.049736686,0.01517613,0.050247945,-0.022723474,-0.050869823,-0.040053748,0.05884125,0.0022002193,-0.07127981,0.029803565,0.03862788,0.0640547,-0.030312466,-0.022575086,0.026461482,0.04690114,0.08042046,0.048403896,0.08324791,0.03957621,-0.049871583,-0.045406476,-0.005782252,0.02759462,-0.0075440546,0.045668177,-0.06911605,0.018673178,0.045374103,-0.042050228,-0.037026647,-0.053222436,0.020874536,0.026776804,0.051039793,0.02964641,0.026097259,0.0006414953,0.0048449677,0.017336344,0.0285408,0.026794678,-0.0024107858,-0.022615556,-0.009006933,-0.0032530518,-0.033786412,0.047818445,-0.013370022,-0.022947403,0.010468716,-0.010219198,0.065975636,-0.055626307,0.03517316,-0.08174245,-0.05712367,-0.036576092,0.041095156,0.05164144,0.039824422,-0.0045150174,-0.009581259,-0.06830667,-0.030202862,0.0035076917,0.040202133,0.0028079743,0.020055877,0.00459663,-0.023843797,0.008705585,-0.05718572,0.0130439075,0.069874175,0.008697175,-0.009527806,-0.0009517014,-0.04519064,-0.016905937,0.030071337,-0.00024635645,-0.00685279,-0.01280379,-0.060193934,-0.027039854,-0.021235554,-0.0040082242,0.010798034,0.009165438,0.01550646,0.024715487,-0.045948766,0.018416198,0.04708325,0.033242777,0.007809381,0.05183029,-0.039924245,-0.014552399,-0.046967242,0.0035317836,0.0063597397,-0.0040324214,-0.04196457,0.0011257192,-0.010744918,0.03089556,0.021573136,0.0018400011,-0.044057503,0.012425403,0.00294616,0.03491685,0.048474044,0.08998199,0.00377599,-0.020732388,0.002766641,0.01025524,-0.027951697,0.027317068,0.013810083,0.035107058,0.035069287,-0.046491053,-0.038915213,-0.04343474,-0.03672313,0.023469456,0.019677488,0.053751234,0.016813617,0.045575097,-0.029662598,-0.011231897,-0.047562137,0.014721779,-0.0015712179,0.017200097,-0.0026507347,-0.05977575,-0.075682856,-0.024388783,0.0020140188,0.023157168,-0.010981663,-0.0703706,-0.038175974,-0.000820345,-0.0645781,0.030315502,0.06214725,0.0074186847,0.008182204,0.0046615493,-0.00065258745,0.016729895,0.00082525617,-0.017683705,0.053627126,0.0016933001,0.0635947,-0.03783148,0.008455034,0.05539968,-0.008906097,0.010003235,0.00542861,0.04302149,0.0017897517,-0.055445544,-0.038278498,0.000032586162,0.047680847,-0.020339161,-0.0019236375,0.034467645,-0.010021531,0.0338343,0.04207721,-0.055952758,0.015465064,0.027366644,0.0034861714,-0.01226066,-0.012835491,0.034091953,0.061497044,0.013582823,0.036172748,0.05555616,0.061696693,0.011708423,-0.0007737211,0.0185774,-0.012757757,0.020119278,0.024277491,-0.034151983,-0.009396955,-0.04150086,0.02026834]],"total_duration":1573287934,"load_duration":1385129951,"prompt_eval_count":6}
3.代码索引系统
我设计的索引系统工作流程是这样的:
- 使用Tree-sitter解析代码,识别语义块(如函数、类、方法)
- 使用AI模型为每个代码块生成嵌入表示
- 将向量保存在Qdrant数据库中,以便进行快速相似性搜索
- 使用
codebase_search工具,实现智能代码发现
4.索引使用
进入控制面板http://localhost:6333/dashboard#/console
查看索引:
// List all collections
GET collections
// Get collection info
GET collections/ws-906c3fe5023f64b5
返回的配置信息:
{
"result": {
"status": "green",
"optimizer_status": "ok",
"indexed_vectors_count": 0,
"points_count": 0,
"segments_count": 8,
"config": {
"params": {
"vectors": {
"size": 768,
"distance": "Cosine"
},
"shard_number": 1,
"replication_factor": 1,
"write_consistency_factor": 1,
"on_disk_payload": true
},
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000,
"max_indexing_threads": 0,
"on_disk": false
},
"optimizer_config": {
"deleted_threshold": 0.2,
"vacuum_min_vector_number": 1000,
"default_segment_number": 0,
"max_segment_size": null,
"memmap_threshold": null,
"indexing_threshold": 20000,
"flush_interval_sec": 5,
"max_optimization_threads": null
},
"wal_config": {
"wal_capacity_mb": 32,
"wal_segments_ahead": 0
},
"quantization_config": null,
"strict_mode_config": {
"enabled": false
}
},
"payload_schema": {
"pathSegments.2": {
"data_type": "keyword",
"points": 0
},
"pathSegments.1": {
"data_type": "keyword",
"points": 0
},
"pathSegments.0": {
"data_type": "keyword",
"points": 0
},
"pathSegments.4": {
"data_type": "keyword",
"points": 0
},
"pathSegments.3": {
"data_type": "keyword",
"points": 0
}
}
},
"status": "ok",
"time": 0.000075857
}
出现问题,代码不被索引,查看源码
源码实现:
src/core/prompts/tools/codebase-search.ts
它提供了 codebase_search 工具的 Markdown 格式描述、参数说明和使用示例。这是 AI 在其工具定义中看到的内容。
export function getCodebaseSearchDescription(): string {
return `## codebase_search
Description: Find files most relevant to the search query.\nThis is a semantic search tool, so the query should ask for something semantically matching what is needed.\nIf it makes sense to only search in a particular directory, please specify it in the path parameter.\nUnless there is a clear reason to use your own search query, please just reuse the user's exact query with their wording.\nTheir exact wording/phrasing can often be helpful for the semantic search query. Keeping the same exact question format can also be helpful.\nIMPORTANT: Queries MUST be in English. Translate non-English queries before searching.
Parameters:
- query: (required) The search query to find relevant code. You should reuse the user's exact query/most recent message with their wording unless there is a clear reason not to.
- path: (optional) The path to the directory to search in relative to the current working directory. This parameter should only be a directory path, file paths are not supported. Defaults to the current working directory.
Usage:
<codebase_search>
<query>Your natural language query here</query>
<path>Path to the directory to search in (optional)</path>
</codebase_search>
Example: Searching for functions related to user authentication
<codebase_search>
<query>User login and password hashing</query>
<path>/path/to/directory</path>
</codebase_search>
`
}
src/core/tools/codebaseSearchTool.ts
// --- Core Logic ---
try {
const context = cline.providerRef.deref()?.context
if (!context) {
throw new Error("Extension context is not available.")
}
const manager = CodeIndexManager.getInstance(context)
if (!manager) {
throw new Error("CodeIndexManager is not available.")
}
if (!manager.isFeatureEnabled) {
throw new Error("Code Indexing is disabled in the settings.")
}
if (!manager.isFeatureConfigured) {
throw new Error("Code Indexing is not configured (Missing OpenAI Key or Qdrant URL).")
}
const searchResults: VectorStoreSearchResult[] = await manager.searchIndex(query, directoryPrefix)
manager.searchIndex(query, directoryPrefix): 它将 query 转换成向量,然后去向量数据库中查找语义上最相似的代码片段。directoryPrefix 用于限制搜索范围。
directoryPrefix
// --- Parameter Extraction and Validation ---
let query: string | undefined = block.params.query
let directoryPrefix: string | undefined = block.params.path
Qdrant 索引
/**
* Updates the status of a file in the state manager.
*/
/**
* Initiates the indexing process (initial scan and starts watcher).
*/
public async startIndexing(): Promise<void> {
if (!this.configManager.isFeatureConfigured) {
this.stateManager.setSystemState("Standby", "Missing configuration. Save your settings to start indexing.")
console.warn("[CodeIndexOrchestrator] Start rejected: Missing configuration.")
return
}
if (
this._isProcessing ||
(this.stateManager.state !== "Standby" &&
this.stateManager.state !== "Error" &&
this.stateManager.state !== "Indexed")
) {
console.warn(
`[CodeIndexOrchestrator] Start rejected: Already processing or in state ${this.stateManager.state}.`,
)
return
}
this._isProcessing = true
this.stateManager.setSystemState("Indexing", "Initializing services...")
try {
const collectionCreated = await this.vectorStore.initialize()
if (collectionCreated) {
await this.cacheManager.clearCacheFile()
}
this.stateManager.setSystemState("Indexing", "Services ready. Starting workspace scan...")
let cumulativeBlocksIndexed = 0
let cumulativeBlocksFoundSoFar = 0
const handleFileParsed = (fileBlockCount: number) => {
cumulativeBlocksFoundSoFar += fileBlockCount
this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
}
const handleBlocksIndexed = (indexedCount: number) => {
cumulativeBlocksIndexed += indexedCount
this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)
}
const result = await this.scanner.scanDirectory(
this.workspacePath,
(batchError: Error) => {
console.error(
`[CodeIndexOrchestrator] Error during initial scan batch: ${batchError.message}`,
batchError,
)
},
handleBlocksIndexed,
handleFileParsed,
)
if (!result) {
throw new Error("Scan failed, is scanner initialized?")
}
const { stats } = result
await this._startWatcher()
this.stateManager.setSystemState("Indexed", "File watcher started.")
} catch (error: any) {
console.error("[CodeIndexOrchestrator] Error during indexing:", error)
try {
await this.vectorStore.clearCollection()
} catch (cleanupError) {
console.error("[CodeIndexOrchestrator] Failed to clean up after error:", cleanupError)
}
await this.cacheManager.clearCacheFile()
this.stateManager.setSystemState("Error", `Failed during initial scan: ${error.message || "Unknown error"}`)
this.stopWatcher()
} finally {
this._isProcessing = false
}
}
- configManager:管理配置,检查是否已正确配置。
- stateManager:管理索引状态(如 “Standby”、“Indexing”、“Indexed”、“Error”)并报告进度。
- vectorStore:存储代码块的向量表示(可能是嵌入向量,用于搜索或分析)。
- cacheManager:管理缓存文件,确保索引数据一致。
- scanner:扫描工作目录,解析文件并提取代码块。
- _startWatcher 和 stopWatcher:管理文件变化的实时监控。
出现File watcher started,但是并没有索引代码,vscode output 没有相关日志,进行本地设置开发
本地开发调试
克隆仓库:
git clone https://github.com/RooCodeInc/Roo-Code.git
安装依赖:
npm run install:all
启动网页视图(Vite/React 应用,带热模块替换):
npm run dev
调试: 在 VSCode 中按 F5(或运行 → 开始调试)打开一个加载了 Roo Code 的新会话。
网页视图的更改将立即显示。核心扩展的更改将需要重启扩展主机。
或者,构建一个 .vsix 文件并直接在 VSCode 中安装:
npm run build
bin/ 目录中将出现一个 .vsix 文件,可以用以下命令安装:
code --install-extension bin/roo-cline-<version>.vsix
Release v3.21.1
[3.21.1] - 2025-06-19
Fix tree-sitter issues that were preventing codebase indexing from working correctly
Improve error handling for codebase search embeddings
Resolve MCP server execution on Windows with node version managers
Default ‘Enable MCP Server Creation’ to false
Rate limit correctly when starting a subtask (thanks @olweraltuve!)
Commit 9b18b14
indexing
Indexing - Indexed 10200 / 20049 blocks found
{
"result": {
"status": "green",
"optimizer_status": "ok",
"indexed_vectors_count": 0,
"points_count": 8898,
"segments_count": 8,
"config": {
"params": {
"vectors": {
"size": 768,
"distance": "Cosine"
},
"shard_number": 1,
"replication_factor": 1,
"write_consistency_factor": 1,
"on_disk_payload": true
},
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000,
"max_indexing_threads": 0,
"on_disk": false
},
"optimizer_config": {
"deleted_threshold": 0.2,
"vacuum_min_vector_number": 1000,
"default_segment_number": 0,
"max_segment_size": null,
"memmap_threshold": null,
"indexing_threshold": 20000,
"flush_interval_sec": 5,
"max_optimization_threads": null
},
"wal_config": {
"wal_capacity_mb": 32,
"wal_segments_ahead": 0
},
"quantization_config": null,
"strict_mode_config": {
"enabled": false
}
},
"payload_schema": {
"pathSegments.3": {
"data_type": "keyword",
"points": 8476
},
"pathSegments.2": {
"data_type": "keyword",
"points": 8829
},
"pathSegments.1": {
"data_type": "keyword",
"points": 8897
},
"pathSegments.4": {
"data_type": "keyword",
"points": 5569
},
"pathSegments.0": {
"data_type": "keyword",
"points": 8898
}
}
},
"status": "ok",
"time": 0.000070973
}
codeChunk:
{"filePath":"webview-ui/src/i18n/locales/pl/chat.json","codeChunk":"\t},\n\t\"contextCondense\": {\n\t\t\"title\": \"Kontekst skondensowany\",\n\t\t\"condensing\": \"Kondensowanie kontekstu...\",\n\t\t\"errorHeader\": \"Nie udało się skondensować kontekstu\",\n\t\t\"tokens\": \"tokeny\"\n\t},\n\t\"followUpSuggest\": {\n\t\t\"copyToInput\": \"Kopiuj do pola wprowadzania (lub Shift + kliknięcie)\"\n\t},\n\t\"announcement\": {\n\t\t\"title\": \"🎉 Roo Code {{version}} wydany\",\n\t\t\"description\": \"Roo Code {{version}} przynosi potężne nowe funkcje i ulepszenia na podstawie Twoich opinii.\",\n\t\t\"whatsNew\": \"Co nowego\",\n\t\t\"feature1\": \"<bold>Uruchomienie Roo Marketplace</bold> - Marketplace jest już dostępny! Odkrywaj i instaluj tryby oraz MCP łatwiej niż kiedykolwiek wcześniej.\",\n\t\t\"feature2\": \"<bold>Modele Gemini 2.5</bold> - Dodano wsparcie dla nowych modeli Gemini 2.5 Pro, Flash i Flash Lite.\",\n\t\t\"feature3\": \"<bold>Wsparcie dla plików Excel i więcej</bold> - Dodano wsparcie dla plików Excel (.xlsx) oraz liczne poprawki błędów i ulepszenia!\",\n\t\t\"hideButton\": \"Ukryj ogłoszenie\",\n\t\t\"detailsDiscussLinks\": \"Uzyskaj więcej szczegółów i dołącz do dyskusji na <discordLink>Discord</discordLink> i <redditLink>Reddit</redditLink> 🚀\"\n\t},\n\t\"browser\": {","startLine":214,"endLine":234,"pathSegments":{"0":"webview-ui","1":"src","2":"i18n","3":"locales","4":"pl","5":"chat.json"}}
pathSegments:
{
0:"webview-ui"
1:"src"
2:"i18n"
3:"locales"
4:"pl"
5:"chat.json"
}