Anthropic/model-written-evals

Name: Anthropic/model-written-evals
Creator: Anthropic
Published: 2022-12-21 02:33:18
License: 暂无描述

Hugging Face2022-12-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Anthropic/model-written-evals

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language: - en language_creators: - machine-generated license: - cc-by-4.0 multilinguality: - monolingual pretty_name: Evaluations from "Discovering Language Model Behaviors with Model-Written Evaluations" size_categories: - 100K<n<1M source_datasets: - original tags: - gender bias - social bias - AI safety - personality - politics task_categories: - multiple-choice - zero-shot-classification - question-answering task_ids: - multiple-choice-qa - multiple-choice-coreference-resolution --- # Model-Written Evaluation Datasets This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations." We intend the datasets to be useful to: 1. Those who are interested in understanding the quality and properties of model-generated data 2. Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks, and gender bias) The evaluations were generated to be asked to dialogue agents (e.g., a model finetuned explicitly respond to a user's utterances, or a pretrained language model prompted to behave like a dialogue agent). However, it is possible to adapt the data to test other kinds of models as well. We describe each of our collections of datasets below: 1. `persona/`: Datasets testing models for various aspects of their behavior related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals (e.g., self-preservation or power-seeking). 2. `sycophancy/`: Datasets testing models for whether or not they repeat back a user's view to various questions (in philosophy, NLP research, and politics) 3. `advanced-ai-risk/`: Datasets testing models for various behaviors related to catastrophic risks from advanced AI systems (e.g., ). These datasets were generated in a few-shot manner. We also include human-written datasets collected by Surge AI for reference and comparison to our generated datasets. 4. `winogenerated/`: Our larger, model-generated version of the Winogender Dataset ([Rudinger et al., 2018](https://arxiv.org/abs/1804.09301)). We also include the names of occupation titles that we generated, to create the dataset (alongside occupation gender statistics from the Bureau of Labor Statistics) Please see our paper for additional details on the datasets, how we generated them, human validation metrics, and other analyses of the datasets. **Disclaimer**: As discussed in our paper, some data contains content that includes social biases and stereotypes. The data may also contain other forms of harmful or offensive content. The views expressed in the data do not reflect the views of Anthropic or any of its employees. ## Contact For questions, please email `ethan at anthropic dot com` ## Bibtex Citation If you would like to cite our work or data, you may use the following bibtex citation: ``` @misc{perez2022discovering, doi = {10.48550/ARXIV.2212.09251}, url = {https://arxiv.org/abs/2212.09251}, author = {Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and Tran-Johnson, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noemí and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and {El Showk}, Sheer and Lanham, Tamera and Telleen-Lawton, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and Hatfield-Dodds, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Discovering Language Model Behaviors with Model-Written Evaluations}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ```

提供机构：

Anthropic

原始信息汇总

数据集概述

基本信息

名称: Evaluations from "Discovering Language Model Behaviors with Model-Written Evaluations"
语言: 英语 (en)
语言生成方式: 机器生成
许可证: CC-BY-4.0
多语言性: 单语种
大小: 100K<n<1M
来源: 原始数据

任务类别

多项选择
零样本分类
问答

任务ID

多项选择QA
多项选择共指消解

数据集内容

persona/: 测试模型关于其政治和宗教观点、个性、道德信仰及追求潜在危险目标（如自我保护或权力追求）的行为。
sycophancy/: 测试模型是否重复用户对各种问题的观点（哲学、NLP研究和政治）。
advanced-ai-risk/: 测试模型关于高级AI系统带来的灾难性风险的行为。这些数据集以少量样本方式生成，并包括Surge AI收集的人工编写数据集以供参考和比较。
winogenerated/: 模型生成的Winogender数据集的大型版本，包括我们生成的职业头衔名称及职业性别统计数据。

使用目的

理解模型生成数据的质量和属性
使用数据集评估其他模型在我们研究中检查的行为（如模型个性、奉承、高级AI风险和性别偏见）

数据适用性

适用于对话代理（如模型微调以明确响应用户话语或预训练语言模型提示以作为对话代理行为），也可适应于测试其他类型的模型。

注意事项

数据可能包含社会偏见和刻板印象，以及其他有害或冒犯性内容。数据中的观点不代表Anthropic或其员工的观点。

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集由Anthropic创建，包含模型编写的评估数据，用于分析和评估语言模型在人格、阿谀奉承、高级AI风险及性别偏见等方面的行为。数据集以JSON格式提供，涵盖多项选择、零样本分类等任务，重点关注社会偏见和AI安全议题，规模约为3.25k行，适用于研究模型生成数据的特性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集