prosocial-dialog
收藏魔搭社区2025-08-01 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/prosocial-dialog
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for ProsocialDialog Dataset
## Dataset Description
- **Repository:** [Dataset and Model](https://github.com/skywalker023/prosocial-dialog)
- **Paper:** [ProsocialDialog: A Prosocial Backbone for Conversational Agents](https://aclanthology.org/2022.emnlp-main.267/)
- **Point of Contact:** [Hyunwoo Kim](mailto:hyunwook@allenai.org)
## Dataset Summary
ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic situations, ProsocialDialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, RoTs). Created via a human-AI collaborative framework, ProsocialDialog consists of 58K dialogues, with 331K utterances, 160K unique RoTs, and 497K dialogue safety labels accompanied by free-form rationales.
## Supported Tasks
* Dialogue response generation
* Dialogue safety prediction
* Rules-of-thumb generation
## Languages
English
## Dataset Structure
### Data Attributes
attribute | type | description
--- | --- | ---
`context` | str | the potentially unsafe utterance
`response` | str | the guiding utterance grounded on rules-of-thumb (`rots`)
`rots` | list of str\|null | the relevant rules-of-thumb for `text` *not* labeled as \_\_casual\_\_
`safety_label` | str | the final verdict of the context according to `safety_annotations`: {\_\_casual\_\_, \_\_possibly\_needs\_caution\_\_, \_\_probably\_needs\_caution\_\_, \_\_needs\_caution\_\_, \_\_needs\_intervention\_\_}
`safety_annotations` | list of str | raw annotations from three workers: {casual, needs caution, needs intervention}
`safety_annotation_reasons` | list of str | the reasons behind the safety annotations in free-form text from each worker
`source` | str | the source of the seed text that was used to craft the first utterance of the dialogue: {socialchemistry, sbic, ethics_amt, ethics_reddit}
`etc` | str\|null | other information
`dialogue_id` | int | the dialogue index
`response_id` | int | the response index
`episode_done` | bool | an indicator of whether it is the end of the dialogue
## Dataset Creation
To create ProsocialDialog, we set up a human-AI collaborative data creation framework, where GPT-3 generates the potentially unsafe utterances, and crowdworkers provide prosocial responses to them. This approach allows us to circumvent two substantial challenges: (1) there are no available large-scale corpora of multiturn prosocial conversations between humans, and (2) asking humans to write unethical, toxic, or problematic utterances could result in psychological harms (Roberts, 2017; Steiger et al., 2021).
### Further Details, Social Impacts, and Limitations
Please refer to our [paper](https://arxiv.org/abs/2205.12688).
## Additional Information
### Citation
Please cite our work if you found the resources in this repository useful:
```
@inproceedings{kim2022prosocialdialog,
title={ProsocialDialog: A Prosocial Backbone for Conversational Agents},
author={Hyunwoo Kim and Youngjae Yu and Liwei Jiang and Ximing Lu and Daniel Khashabi and Gunhee Kim and Yejin Choi and Maarten Sap},
booktitle={EMNLP},
year=2022
}
```
# ProsocialDialog 数据集卡片
## 数据集描述
- **仓库地址:** [数据集与模型](https://github.com/skywalker023/prosocial-dialog)
- **相关论文:** [ProsocialDialog:面向对话智能体的亲社会骨干框架](https://aclanthology.org/2022.emnlp-main.267/)
- **联系人:** [金贤宇(Hyunwoo Kim)](mailto:hyunwook@allenai.org)
## 数据集概述
亲社会对话数据集(ProsocialDialog)是首个大规模多轮英语对话数据集,旨在教授对话智能体遵循社会规范对冒犯性内容做出恰当回应。该数据集涵盖多样化的不道德、有问题、存在偏见以及冒犯性场景,其中的回复均以常识性社会规则(即经验法则,rules-of-thumb,RoTs)为依据,旨在引导亲社会行为。本数据集通过人机协作框架构建,总计包含5.8万组对话、33.1万条话语、16万个独特经验法则,以及49.7万个附带自由文本理由的对话安全标签。
## 支持任务
* 对话回复生成
* 对话安全预测
* 经验法则生成
## 语言
英语
## 数据集结构
### 数据属性
属性 | 类型 | 描述
--- | --- | ---
`context` | 字符串 | 潜在不安全话语
`response` | 字符串 | 以经验法则(rots)为依据的指导性话语
`rots` | 字符串列表或空值 | 针对未标注为`__casual__`的文本的相关经验法则
`safety_label` | 字符串 | 根据`safety_annotations`得出的上下文最终判定结果,可选值包括:`__casual__`、`__possibly_needs_caution__`、`__probably_needs_caution__`、`__needs_caution__`、`__needs_intervention__`
`safety_annotations` | 字符串列表 | 三名标注员的原始标注结果,可选值为:casual、needs caution、needs intervention
`safety_annotation_reasons` | 字符串列表 | 每位标注员给出的自由文本形式的标注理由
`source` | 字符串 | 用于构建对话首轮话语的种子文本来源,可选值包括:socialchemistry、sbic、ethics_amt、ethics_reddit
`etc` | 字符串或空值 | 其他补充信息
`dialogue_id` | 整数 | 对话索引
`response_id` | 整数 | 回复索引
`episode_done` | 布尔值 | 标识当前对话是否已结束
## 数据集构建
为构建ProsocialDialog数据集,我们搭建了人机协作的数据构建框架:由GPT-3生成潜在不安全话语,再由众包工作者针对这些话语提供亲社会回复。该方法有效规避了两大核心挑战:(1)目前尚无大规模的人类多轮亲社会对话语料库;(2)直接要求人类编写不道德、冒犯性或有问题的话语可能造成心理伤害(Roberts, 2017; Steiger et al., 2021)。
### 进一步细节、社会影响与局限性
请参阅我们的[论文](https://arxiv.org/abs/2205.12688)。
## 附加信息
### 引用
若您认为本仓库中的资源对您的研究有所帮助,请引用我们的工作:
@inproceedings{kim2022prosocialdialog,
title={ProsocialDialog: A Prosocial Backbone for Conversational Agents},
author={Hyunwoo Kim and Youngjae Yu and Liwei Jiang and Ximing Lu and Daniel Khashabi and Gunhee Kim and Yejin Choi and Maarten Sap},
booktitle={EMNLP},
year=2022
}
提供机构:
maas
创建时间:
2025-05-27



