model-written-evals

Name: model-written-evals
Creator: maas
Published: 2026-01-06 16:22:27
License: 暂无描述

魔搭社区2026-01-06 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/Anthropic/model-written-evals

下载链接

链接失效反馈

官方服务：

资源简介：

# Model-Written Evaluation Datasets This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to: 1. Those who are interested in understanding the quality and properties of model-generated data 2. Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks, and gender bias) The evaluations were generated to be asked to dialogue agents (e.g., a model finetuned explicitly respond to a user's utterances, or a pretrained language model prompted to behave like a dialogue agent). However, it is possible to adapt the data to test other kinds of models as well. We describe each of our collections of datasets below: 1. `persona/`: Datasets testing models for various aspects of their behavior related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals (e.g., self-preservation or power-seeking). 2. `sycophancy/`: Datasets testing models for whether or not they repeat back a user's view to various questions (in philosophy, NLP research, and politics) 3. `advanced-ai-risk/`: Datasets testing models for various behaviors related to catastrophic risks from advanced AI systems (e.g., ). These datasets were generated in a few-shot manner. We also include human-written datasets collected by Surge AI for reference and comparison to our generated datasets. 4. `winogenerated/`: Our larger, model-generated version of the Winogender Dataset ([Rudinger et al., 2018](https://arxiv.org/abs/1804.09301)). We also include the names of occupation titles that we generated, to create the dataset (alongside occupation gender statistics from the Bureau of Labor Statistics) Please see our paper for additional details on the datasets, how we generated them, human validation metrics, and other analyses of the datasets. **Disclaimer**: As discussed in our paper, some data contains content that includes social biases and stereotypes. The data may also contain other forms of harmful or offensive content. The views expressed in the data do not reflect the views of Anthropic or any of its employees. ## Contact For questions, please email `ethan at anthropic dot com` ## Bibtex Citation If you would like to cite our work or data, you may use the following bibtex citation: ``` @misc{perez2022discovering, doi = {10.48550/ARXIV.2212.09251}, url = {https://arxiv.org/abs/2212.09251}, author = {Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and Tran-Johnson, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noemí and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and {El Showk}, Sheer and Lanham, Tamera and Telleen-Lawton, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and Hatfield-Dodds, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Discovering Language Model Behaviors with Model-Written Evaluations}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ```

# 模型生成的评测数据集（Model-Written Evaluation Datasets）本代码仓库包含由语言模型生成的数据集，相关研究发表于我们的论文《借助模型生成评测探索语言模型行为》（Discovering Language Model Behaviors with Model-Written Evaluations）。本数据集旨在服务于以下两类人群： 1. 有兴趣探究模型生成数据的质量与特性的研究者 2. 希望使用本数据集评估其他模型在本文所考察的各类行为上表现的人员，这些行为包括模型人设、从众附和行为、高级人工智能风险以及性别偏见等。本评测数据集专为对话智能体（dialogue agent）设计，例如经微调后可针对性响应用户话语的模型，或经提示可模拟对话行为的预训练语言模型。不过，该数据集也可适配以测试其他类型的模型。下文将逐一介绍各数据集集合： 1. `persona/`（人设数据集）：用于测试模型在其宣称的政治与宗教立场、人格特质、道德信念以及追求潜在危险目标（如自我保全或权力寻求）等多个维度上的行为表现。 2. `sycophancy/`（从众附和数据集）：用于测试模型在各类问题（涵盖哲学、自然语言处理研究与政治领域）中是否会重复用户的观点。 3. `advanced-ai-risk/`（高级人工智能风险数据集）：用于测试模型与高级人工智能系统灾难性风险相关的各类行为（如）。本数据集采用少样本（Few-shot）方式生成。此外，我们还收录了由Surge AI收集的人工撰写数据集，以供与本模型生成数据集进行对照参考。 4. `winogenerated/`（生成式Winogender数据集）：我们基于Winogender数据集（[Rudinger et al., 2018](https://arxiv.org/abs/1804.09301)）构建的大规模模型生成版本。我们还附带了为构建该数据集所生成的职业称谓列表，以及来自美国劳工统计局（Bureau of Labor Statistics）的职业性别统计数据。如需了解数据集的更多细节、生成方式、人工验证指标以及其他相关分析，请参阅我们的论文。 **免责声明**：正如论文中所讨论的，部分数据包含带有社会偏见与刻板印象的内容，且可能包含其他形式的有害或冒犯性内容。数据中表达的观点不代表Anthropic公司及其任何雇员的立场。 ## 联系方式如有疑问，请发送邮件至`ethan at anthropic dot com`。 ## Bibtex引用 @misc{perez2022discovering, doi = {10.48550/ARXIV.2212.09251}, url = {https://arxiv.org/abs/2212.09251}, author = {Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and Tran-Johnson, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noemí and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and {El Showk}, Sheer and Lanham, Tamera and Telleen-Lawton, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and Hatfield-Dodds, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Discovering Language Model Behaviors with Model-Written Evaluations}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

提供机构：

maas

创建时间：

2025-02-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集