commonsense-dialogues
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/gretelai/commonsense-dialogues
下载链接
链接失效反馈官方服务:
资源简介:
## Commonsense-Dialogues Dataset
This is the Commonsense-Dialogues, a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The dataset was released by Amazon Alexa AI team in collaboration with the University of Southern California (USC), and also available [Commonsense-Dialogues repo](https://github.com/alexa/Commonsense-Dialogues/tree/main)
The social contexts used were sourced from the **train** split of the [SocialIQA](https://leaderboard.allenai.org/socialiqa/submissions/get-started) dataset, a multiple-choice question-answering based social commonsense reasoning benchmark.
For the collection of the Commonsense-Dialogues dataset, each Turker was presented a social context and asked to write a dialogue of 4-6 turns between two people based on the event(s) described in the context. The Turker was asked to alternate between the roles of an individual referenced in the context and a 3rd party friend. See the following dialogues as examples:
```
"1": { # dialogue_id
"context": "Sydney met Carson's mother for the first time last week. He liked her.", # multiple individuals in the context: Sydney and Carson
"speaker": "Sydney", # role 1 = Sydney, role 2 = a third-person friend of Sydney
"turns": [
"I met Carson's mother last week for the first time.",
"How was she?",
"She turned out to be really nice. I like her.",
"That's good to hear.",
"It is, especially since Carson and I are getting serious.",
"Well, at least you'll like your in-law if you guys get married."
]
}
"2": {
"context": "Kendall had a party at Jordan's house but was found out to not have asked and just broke in.",
"speaker": "Kendall",
"turns": [
"Did you hear about my party this weekend at Jordan\u2019s house?",
"I heard it was amazing, but that you broke in.",
"That was a misunderstanding, I had permission to be there.",
"Who gave you permission?",
"I talked to Jordan about it months ago before he left town to go to school, but he forgot to tell his roommates about it.",
"Ok cool, I hope everything gets resolved."
]
}
```
The data consist of 3 subsets: `train.json` has ~9K dialogues, `valid.json` and `test.json` have ~1K dialogues each. Since all the contexts were sourced from the **train** split of SocialIQA, it is imperative to note that any form of **multi-task** training and evaluation with Commonsense-Dialogues and SocialIQA must be done with caution to ensure fair and accurate conclusions.
Some statistics about the data are provided below:
| Stat | Train | Valid | Test |
| ---- | ---- | ---- | ---- |
|# of dialogues | 9058 | 1157 | 1158 |
|average # of turns in a dialogue | 5.72 | 5.72 | 5.71 |
|average # of words in a turn | 12.4 | 12.4 | 12.2 |
|# of distinct SocialIQA contexts used | 3672 | 483 | 473 |
|average # of dialogues for a SocialIQA context| 2.46 | 2.395 | 2.45 |
## Security
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
## License
This repository is licensed under the CC-BY-NC 4.0 License.
## Citation
If you use this dataset, please cite the following paper:
```
@inproceedings{zhou-etal-2021-commonsense,
title = "Commonsense-Focused Dialogues for Response Generation: An Empirical Study",
author = "Zhou, Pei and
Gopalakrishnan, Karthik and
Hedayatnia, Behnam and
Kim, Seokhwan and
Pujara, Jay and
Ren, Xiang and
Liu, Yang and
Hakkani-Tur, Dilek",
booktitle = "Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue",
year = "2021",
address = "Singapore and Online",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2109.06427"
}
```
Note that the paper uses newly collected dialogues as well as those that were filtered from existing datasets. This repo contains our newly collected dialogues alone.
# 常识对话数据集(Commonsense-Dialogues Dataset)
本数据集为常识对话数据集(Commonsense-Dialogues Dataset),是一个由众包构建的、包含约1.1万段基于涉及常识运用的社交场景的对话数据集。该数据集由亚马逊Alexa人工智能(Amazon Alexa AI)团队与南加州大学(University of Southern California, USC)联合发布,其公开仓库地址为:[Commonsense-Dialogues 仓库](https://github.com/alexa/Commonsense-Dialogues/tree/main)。
本次数据集所使用的社交场景均源自[SocialIQA](https://leaderboard.allenai.org/socialiqa/submissions/get-started)数据集的训练划分(train split),后者是一个基于多项选择问答的社交常识推理基准数据集。
在该数据集的采集流程中,每位Turker将获得一段社交场景文本,并需基于文本描述的事件创作一段4至6轮的双人对话。Turker需交替扮演场景中提及的人物与一位第三方好友两个角色。以下为示例对话:
"1": { # 对话ID
"context": "Sydney上周首次与Carson的母亲见面,并且很喜欢她。", # 场景中涉及的多位人物:Sydney与Carson
"speaker": "Sydney", # 角色1 = Sydney,角色2 = Sydney的一位第三方好友
"turns": [
"我上周首次与Carson的母亲见面了。",
"她人怎么样?",
"她真的非常友善,我很喜欢她。",
"那太好了。",
"没错,尤其是考虑到我和Carson的关系越来越亲密。",
"那至少如果你们结婚的话,你会喜欢你的姻亲的。"
]
}
"2": {
"context": "Kendall在Jordan的家里举办了派对,但后来被发现并未征得同意就擅自闯入。",
"speaker": "Kendall",
"turns": [
"你听说我这周末在Jordan家举办的派对了吗?",
"我听说派对很棒,但你是擅自闯入的。",
"那是误会,我是得到许可才能在那里的。",
"谁给你的许可?",
"几个月前Jordan离开小镇去上学之前,我和他沟通过这件事,但他忘了告诉室友。",
"好的,希望一切都能解决。"
]
}
该数据集包含三个子集:`train.json`包含约9000段对话,`valid.json`与`test.json`各包含约1000段对话。由于所有场景均源自SocialIQA的训练划分,需特别注意:若使用常识对话数据集与SocialIQA进行多任务训练与评估,务必谨慎操作,以确保得出公平且准确的结论。
以下为该数据集的统计信息:
| 统计项 | 训练集 | 验证集 | 测试集 |
|----------------------------|--------|--------|--------|
| 对话总数 | 9058 | 1157 | 1158 |
| 单段对话平均轮次 | 5.72 | 5.72 | 5.71 |
| 单轮对话平均词数 | 12.4 | 12.4 | 12.2 |
| 所用不同SocialIQA场景数 | 3672 | 483 | 473 |
| 单个SocialIQA场景对应平均对话数 | 2.46 | 2.395 | 2.45 |
## 安全事项
详见[CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications)了解更多相关信息。
## 许可证
本仓库采用CC-BY-NC 4.0许可证进行授权。
## 引用
若您在工作中使用该数据集,请引用下述论文:
@inproceedings{zhou-etal-2021-commonsense,
title = "Commonsense-Focused Dialogues for Response Generation: An Empirical Study",
author = "Zhou, Pei and
Gopalakrishnan, Karthik and
Hedayatnia, Behnam and
Kim, Seokhwan and
Pujara, Jay and
Ren, Xiang and
Liu, Yang and
Hakkani-Tur, Dilek",
booktitle = "Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue",
year = "2021",
address = "Singapore and Online",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2109.06427"
}
需注意,原论文所使用的对话既包含新采集的对话,也包含从现有数据集筛选出的对话;而本仓库仅包含我们新采集的对话数据。
提供机构:
maas
创建时间:
2025-05-20



