先放几篇值得反复回看的模型可解释性材料:

对定性研究的思考 — Reflections on Qualitative Research

特征可视化 — Feature Visualization

Tracing the thoughts of a large language model - Anthropic