<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>模型安全 on 杨の草原</title><link>https://thinkless-github-io.pages.dev/tags/%E6%A8%A1%E5%9E%8B%E5%AE%89%E5%85%A8/</link><description>Recent content in 模型安全 on 杨の草原</description><generator>Hugo</generator><language>zh-CN</language><lastBuildDate>Mon, 02 Feb 2026 16:27:09 +0800</lastBuildDate><atom:link href="https://thinkless-github-io.pages.dev/tags/%E6%A8%A1%E5%9E%8B%E5%AE%89%E5%85%A8/index.xml" rel="self" type="application/rss+xml"/><item><title>基于人类反馈的强化学习（RLHF）4</title><link>https://thinkless-github-io.pages.dev/posts/%E5%9F%BA%E4%BA%8E%E4%BA%BA%E7%B1%BB%E5%8F%8D%E9%A6%88%E7%9A%84%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0rlhf4/</link><pubDate>Mon, 02 Feb 2026 16:27:09 +0800</pubDate><guid>https://thinkless-github-io.pages.dev/posts/%E5%9F%BA%E4%BA%8E%E4%BA%BA%E7%B1%BB%E5%8F%8D%E9%A6%88%E7%9A%84%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0rlhf4/</guid><description>别被大模型满天飞的“SOTA”跑分骗了！高分真代表好用吗？评估 RLHF 模型远不只是看通过率。本文梳理以“HHH”为核心的对齐评估体系，拆解训练过程中奖励分数与 KL 散度的权衡逻辑。从人工评估的实验设计到自动化基准的去噪技巧，再到红队测试的对抗性验证，给出一套从微调监控到安全部署的全链路评估指南。</description></item><item><title>大模型的安全性</title><link>https://thinkless-github-io.pages.dev/posts/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9A%84%E5%AE%89%E5%85%A8%E6%80%A7/</link><pubDate>Wed, 07 May 2025 10:36:21 +0800</pubDate><guid>https://thinkless-github-io.pages.dev/posts/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9A%84%E5%AE%89%E5%85%A8%E6%80%A7/</guid><description>大模型安全性笔记，记录对抗攻击原理与防御策略，包括白盒、灰盒、黑盒攻击、token 操作和梯度攻击机制。</description></item></channel></rss>