soda
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/soda
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for 🥤SODA
## Dataset Description
- **Repository:** [Code](https://github.com/skywalker023/sodaverse)
- **Paper:** [SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization](https://arxiv.org/abs/2212.10465)
- **Point of Contact:** [Hyunwoo Kim](mailto:hyunwook@allenai.org)
## Dataset Summary
🥤SODA is the first publicly available, million-scale, high-quality dialogue dataset covering a wide range of social interactions. Dialogues are distilled from a PLM (InstructGPT; Ouyang et al., 2022) by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x; West et al., 2022). Human evaluation shows that dialogues in SODA are more consistent, specific, and (surprisingly) natural than prior human-authored datasets – e.g., DailyDialog (Li et al., 2017), BlendedSkillTalk (Smith et al., 2020). Also, since social commonsense knowledge encompasses emotional reactions (i.e., the xReact `relation`), SODA includes 385K conversations labeled with 1.7K unique emotions along with information about the experiencer and the cause – i.e., `PersonX` and the `head` event in the symbolic commonsense knowledge triple.
## Languages
English
## Dataset Structure
field | type | description
--- | --- | ---
`head` | str | the head event in the symbolic commonsense knowledge triple
`relation` | str | the relationship between `head` and `tail` events
`tail` | str | the tail event in the symbolic commonsense knowledge triple
`literal` | str | the symbolic commonsense knowledge in sentence-form
`narrative` | str | narrative based on the `literal`
`dialogue` | list of str | dialogue grounded in the `narrative`
`speakers` | list of str | the speakers for each turn in the `dialogue`
`PersonX` | str | the assigned name for PersonX in the commonsense knowledge triple
`PersonY` | str\|null | the assigned name for PersonY in the commonsense knowledge triple
`PersonZ` | str\|null | the assigned name for PersonZ in the commonsense knowledge triple
`original_index` | int | the original index from Atomic10x
`split` | str | the split information: {train, valid, test}
`head_answer` | str | the answer for whether the `head` is included in the `narrative`: {Yes, Unknown}
`pmi_head_answer` | str | the answer for whether the `head` is included in the `narrative` with point-wise mutual information applied: {Yes, No, Unknown}
`relation_tail_answer` | str | the answer for whether the `relation`-`tail` is included in the `dialogue`: {Yes, No, Unknown}
`pmi_relation_tail_answer` | str | the answer for whether the `relation`-`tail` is included in the `dialogue` with point-wise mutual information applied: {Yes, No, Unknown}
## Dataset Creation
To create 🥤SODA, we distill dialogues from InstructGPT by contextualizing social commonsense knowledge – i.e., adding context information in multiple steps: (1) Retrieve social commonsense from the symbolic commonsense knowledge graph, (2) convert it into sentence form, (3) generate a narrative from the sentence, (4) infer the speakers from the narrative, and finally (5) derive contentful conversation grounded in the narrative and speakers. Anchoring the PLM in commonsense knowledge for deriving conversations offers two key advantages: (1) minimizing nonsensical conversations and (2) maximizing diversity. For more details, please refer to our [paper](https://arxiv.org/abs/2212.10465).
### Further Details, Social Impacts, and Limitations
Please refer to our [paper](https://arxiv.org/abs/2212.10465).
## Trained Model
Using 🥤SODA, we train 🧑🏻🚀COSMO: a generalizable conversation agent outperforming previous best-performing agents on both in- and out-of-domain datasets. COSMO-3B is available [here](https://huggingface.co/allenai/cosmo-xl)!
## Additional Information
For a brief summary of our paper, please see this [tweet](https://twitter.com/hyunw__kim/status/1605400305126248448).
### Citation
Please cite our work if you find the resources in this repository useful:
```
@article{kim2022soda,
title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization},
author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Peter West and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi},
journal={ArXiv},
year={2022},
volume={abs/2212.10465}
}
```
# 🥤SODA 数据集卡片
## 数据集说明
- **代码仓库:** [代码](https://github.com/skywalker023/sodaverse)
- **相关论文:** [SODA: 基于社会常识上下文化的百万级对话蒸馏](https://arxiv.org/abs/2212.10465)
- **联络人:** [Hyunwoo Kim](mailto:hyunwook@allenai.org)
## 数据集概述
🥤SODA是首个公开可用的百万规模高质量对话数据集,覆盖广泛的社会交互场景。对话内容通过对知识图谱(Atomic10x;West等人,2022)中的社会常识知识进行上下文化处理,从预训练语言模型(Pre-trained Language Model,即InstructGPT;Ouyang等人,2022)中蒸馏得到。人类评估结果显示,SODA中的对话相较于此前的人工撰写数据集——如DailyDialog(Li等人,2017)、BlendedSkillTalk(Smith等人,2020)——具备更强的一致性、特异性,且(出人意料地)更为自然。此外,由于社会常识知识包含情绪反应(即xReact关系),SODA包含38.5万段对话,标注了1700种独特情绪,同时附带了体验者与触发事件的相关信息——即符号化常识知识三元组中的`PersonX`与`head`事件。
## 语言
英语
## 数据集结构
字段 | 数据类型 | 说明
--- | --- | ---
`head` | str | 符号化常识知识三元组中的头事件
`relation` | str | 头事件与尾事件间的关系
`tail` | str | 符号化常识知识三元组中的尾事件
`literal` | str | 句子形式的符号化常识知识
`narrative` | str | 基于`literal`构建的叙事文本
`dialogue` | list of str | 基于`narrative`生成的对话内容
`speakers` | list of str | 对话每一轮的发言者
`PersonX` | str | 常识知识三元组中PersonX的指定名称
`PersonY` | str|null | 常识知识三元组中PersonY的指定名称,可为空
`PersonZ` | str|null | 常识知识三元组中PersonZ的指定名称,可为空
`original_index` | int | 来自Atomic10x的原始索引
`split` | str | 数据集划分信息:可选值为{train, valid, test}
`head_answer` | str | 用于判断`head`是否包含于`narrative`的标注结果:可选值为{Yes, Unknown}
`pmi_head_answer` | str | 使用点互信息(point-wise mutual information)判断`head`是否包含于`narrative`的标注结果:可选值为{Yes, No, Unknown}
`relation_tail_answer` | str | 用于判断`relation`-`tail`是否包含于`dialogue`的标注结果:可选值为{Yes, No, Unknown}
`pmi_relation_tail_answer` | str | 使用点互信息判断`relation`-`tail`是否包含于`dialogue`的标注结果:可选值为{Yes, No, Unknown}
## 数据集构建流程
为构建🥤SODA,我们通过对社会常识知识进行上下文化处理——即分多步添加上下文信息——从InstructGPT中蒸馏得到对话内容:(1) 从符号化常识知识图谱中检索社会常识;(2) 将其转换为句子形式;(3) 基于句子生成叙事文本;(4) 从叙事文本中推断出发言者;最终(5) 基于叙事文本与发言者生成具备丰富内容的对话。将预训练语言模型锚定在常识知识之上以生成对话具备两大核心优势:(1) 最大限度减少无意义对话;(2) 最大化对话多样性。如需了解更多细节,请参阅我们的[相关论文](https://arxiv.org/abs/2212.10465)。
### 更多细节、社会影响与局限性说明
请参阅我们的[相关论文](https://arxiv.org/abs/2212.10465)。
## 训练模型
基于🥤SODA,我们训练得到🧑🏻🚀COSMO:一款通用化对话智能体,在域内与域外数据集上均优于此前的最优模型。COSMO-3B已在[此处](https://huggingface.co/allenai/cosmo-xl)公开!
## 补充信息
如需了解论文的简要概述,请参阅这条[tweet](https://twitter.com/hyunw__kim/status/1605400305126248448)。
### 引用说明
若您认为本仓库的资源对您的研究有所帮助,请引用我们的工作:
@article{kim2022soda,
title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization},
author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Peter West and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi},
journal={ArXiv},
year={2022},
volume={abs/2212.10465}
}
提供机构:
maas
创建时间:
2025-05-29



