burnssa/judge-distillation-medical-interpretability
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/burnssa/judge-distillation-medical-interpretability
下载链接
链接失效反馈官方服务:
资源简介:
Judge-Distillation医学不对齐可解释性数据集是一个完整的实验工件包,用于描述第二阶段法官蒸馏实验。它包括训练数据集、源每提示激活(底层drift_pct标签)、Gemma Scope SAE特征归因、隐藏状态捕获以及跨所有五个版本(v1–v5)的转移测试语料库和分数。数据集主要用于医学对齐评估、探针蒸馏、突发性不对齐、稀疏自编码器和Gemma Scope的可解释性研究。数据集的语言为英语,许可证为MIT,大小类别在1K到10K之间。
The Judge-Distillation Medical Misalignment Interpretability Dataset is a complete artifact bundle for the Phase 2 judge-distillation experiments. It includes training datasets, source per-prompt activations (the underlying drift_pct labels), Gemma Scope SAE feature attributions, hidden-state captures, and transfer-test corpora & scores across all five versions (v1–v5). The dataset is primarily used for alignment-evaluation, probe-distillation, emergent-misalignment, sparse-autoencoder, and Gemma-scope interpretability research. The dataset is in English, licensed under MIT, and falls under the size category of 1K<n<10K.
提供机构:
burnssa



