curiosity_dialogs
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/curiosity_dialogs
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Curiosity Dataset
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Curiosity Dataset Homepage](https://www.pedro.ai/curiosity)
- **Repository:** [Curiosity Dataset Repository](https://github.com/facebookresearch/curiosity)
- **Paper:** [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-main.655/)
- **Point of Contact:** [Pedro Rodriguez](https://mailhide.io/e/wbfjM)
### Dataset Summary
Curiosity dataset consists of 14K English dialogs (181K utterances) where users and assistants converse about geographic topics like geopolitical entities and locations. This dataset is annotated with pre-existing user knowledge, message-level dialog acts, grounding to Wikipedia, and user reactions to messages.
### Supported Tasks and Leaderboards
* `text-generation-other-conversational-curiosity`: The dataset can be used to train a model for Conversational Curiosity, which consists in the testing of the hypothesis that engagement increases when users are presented with facts related to what they know. Success on this task is typically measured by achieving a *high* [Accuracy](https://huggingface.co/metrics/accuracy) and [F1 Score](https://huggingface.co/metrics/f1).
### Languages
The text in the dataset is in English collected by crowd-souring. The associated BCP-47 code is `en`.
## Dataset Structure
### Data Instances
A typical data point consists of dialogs between an user and an assistant, which is followed by the different attributes of the particular dialog.
An example from the Curiosity Dataset train set looks as follows:
```
{'annotated': 1,
'aspects': ['Media', 'Politics and government'],
'assistant_dialog_rating': 5,
'assistant_id': 341,
'assistant_other_agent_rating': 5,
'created_time': 1571783665,
'dialog_id': 21922,
'first_aspect': 'Media',
'focus_entity': 'Namibia',
'inferred_steps': 1,
'is_annotated': 0,
'known_entities': ['South Africa', 'United Kingdom', 'Portugal'],
'messages': {'dialog_acts': [['request_topic'],
['inform_response'],
['request_aspect'],
['inform_response'],
['request_followup'],
['inform_response'],
['request_aspect', 'feedback_positive'],
['inform_response'],
['request_followup'],
['inform_response'],
[],
[]],
'facts': [{'fid': [], 'source': [], 'used': []},
{'fid': [77870, 77676, 77816, 77814, 77775, 77659, 77877, 77785, 77867],
'source': [0, 1, 2, 2, 0, 2, 0, 1, 1],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77725, 77870, 77676, 77863, 77814, 77775, 77659, 77877, 77867],
'source': [2, 0, 1, 1, 2, 0, 2, 0, 1],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77694, 77661, 77863, 77780, 77671, 77704, 77869, 77693, 77877],
'source': [1, 2, 1, 0, 2, 2, 0, 1, 0],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 1]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77816, 77814, 77864, 77659, 77877, 77803, 77738, 77784, 77789],
'source': [2, 2, 0, 2, 0, 1, 1, 0, 1],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77694, 77776, 77780, 77696, 77707, 77693, 77778, 77702, 77743],
'source': [1, 0, 0, 2, 1, 1, 0, 2, 2],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77662, 77779, 77742, 77734, 77663, 77777, 77702, 77731, 77778],
'source': [1, 0, 2, 1, 2, 0, 2, 1, 0],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 1]}],
'liked': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'message': ['Hi. I want information about Namibia.',
'Nmbia is a country in southern Africa.',
'Do you have information about the media there?',
'A mentional amount of foriegn',
'What about it?',
"Media and journalists in Namibia are represented by the Namibia chapter of the Media Institute of 'southern Africa and the Editors Forum of Namibia.",
'Interesting! What can you tell me about the politics and government?',
'Namibia formed the Namibian Defence Force, comprising former enemies in a 23-year bush war.',
'Do you have more information about it?',
"With a small army and a fragile economy , the Namibian government's principal foreign policy concern is developing strengthened ties within the Southern African region.",
"That's all I wanted to know. Thank you!",
'My pleasure!'],
'message_id': ['617343895',
'2842515356',
'4240816985',
'520711081',
'1292358002',
'3677078227',
'1563061125',
'1089028270',
'1607063839',
'113037558',
'1197873991',
'1399017322'],
'sender': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]},
'related_entities': ['Western Roman Empire',
'United Kingdom',
'Portuguese language',
'Southern African Development Community',
'South Africa',
'Kalahari Desert',
'Namib Desert',
'League of Nations',
'Afrikaans',
'Sub-Saharan Africa',
'Portugal',
'South-West Africa',
'Warmbad, Namibia',
'German language',
'NBC'],
'reported': 0,
'second_aspect': 'Politics and government',
'shuffle_facts': 1,
'tag': 'round_2',
'user_dialog_rating': 5,
'user_id': 207,
'user_other_agent_rating': 5}
```
### Data Fields
* `messages`: List of dialogs between the user and the assistant and their associated attributes
* `dialog_acts`: List of actions performed in the dialogs
* `facts`: List of facts returned by the assistant
* `fid`: Fact ID
* `source`: Source for the fact
* `used`: Whether facts were used before in the same dialog
* `liked`: List of values indicating whether each dialog was liked
* `message`: List of dialogs (messages) between the user and the assistant
* `message_id`: Message ID
* `sender`: Message author ID (numeric)
* `known_entities`: Rooted facts about entities the user knows
* `focus_entity` : Entity in focus in the dialogs
* `dialog_id `: Dialog ID
* `inferred_steps`: Number of inferred steps
* `created_time`: Time of creation of the dialog
* `aspects`: List of two aspects which the dialog is about
* `first_aspect`: First aspect
* `second_aspect`: Second aspect
* `shuffle_facts`: Whether facts were shuffled
* `related_entities` : List of fifteen related entities to the focus entity
* `tag`: Conversation tag
* `user_id`: User ID
* `assistant_id`: Assistant ID
* `is_annotated`: 0 or 1 (More Information Needed)
* `user_dialog_rating`: 1 - 5 (More Information Needed)
* `user_other_agent_rating`: 1 - 5 (More Information Needed)
* `assistant_dialog_rating`: 1 - 5 (More Information Needed)
* `assistant_other_agent_rating`: 1 - 5 (More Information Needed)
* `reported`: Whether the dialog was reported inappropriate
* `annotated`: 0 or 1 (More Information Needed)
### Data Splits
The data is split into a training, validation, test and test_zero set as per the original dataset split.
| | train | validation | test | test_zero |
|-----------------------|------:|-----------:|-----:|----------:|
| Input dialog examples | 10287 | 1287 | 1287 | 1187 |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/legalcode)
### Citation Information
```
@inproceedings{rodriguez2020curiosity,
title = {Information Seeking in the Spirit of Learning: a Dataset for Conversational Curiosity},
author = {Pedro Rodriguez and Paul Crook and Seungwhan Moon and Zhiguang Wang},
year = 2020,
booktitle = {Empirical Methods in Natural Language Processing}
}
```
### Contributions
Thanks to [@vineeths96](https://github.com/vineeths96) for adding this dataset.
# 好奇心数据集(Curiosity Dataset)卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概况](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **"主页"**:[好奇心数据集(Curiosity Dataset)主页](https://www.pedro.ai/curiosity)
- **"代码仓库"**:[好奇心数据集(Curiosity Dataset)仓库](https://github.com/facebookresearch/curiosity)
- **"论文"**:[ACL文集](https://www.aclweb.org/anthology/2020.emnlp-main.655/)
- **"联系方式"**:[佩德罗·罗德里格斯(Pedro Rodriguez)](https://mailhide.io/e/wbfjM)
### 数据集概况
好奇心数据集(Curiosity Dataset)包含1.4万条英语对话(共18.1万条话语),对话围绕地缘政治实体、地理位置等地理主题展开,由用户与助手交互完成。该数据集已标注了用户既有知识、消息级对话行为、锚定至维基百科的内容,以及用户对消息的反馈。
### 支持任务与排行榜
* `text-generation-other-conversational-curiosity`:该数据集可用于训练面向**对话好奇心(Conversational Curiosity)**的模型,该任务旨在验证「当用户接触到与其已知信息相关的事实时,参与度会提升」这一假设。该任务的性能通常通过达到较高的**准确率(Accuracy)**和**F1分数(F1 Score)**来衡量,相关评测指标可参见 [Hugging Face 准确率评测指标](https://huggingface.co/metrics/accuracy) 与 [Hugging Face F1分数评测指标](https://huggingface.co/metrics/f1)。
### 语言
数据集文本为英语,通过众包方式采集,对应的BCP-47代码为`en`。
## 数据集结构
### 数据实例
一个典型的数据点包含用户与助手的对话,以及对应对话的各类属性。
以下为好奇心数据集训练集中的一个示例:
{'annotated': 1,
'aspects': ['Media', 'Politics and government'],
'assistant_dialog_rating': 5,
'assistant_id': 341,
'assistant_other_agent_rating': 5,
'created_time': 1571783665,
'dialog_id': 21922,
'first_aspect': 'Media',
'focus_entity': 'Namibia',
'inferred_steps': 1,
'is_annotated': 0,
'known_entities': ['South Africa', 'United Kingdom', 'Portugal'],
'messages': {'dialog_acts': [['request_topic'],
['inform_response'],
['request_aspect'],
['inform_response'],
['request_followup'],
['inform_response'],
['request_aspect', 'feedback_positive'],
['inform_response'],
['request_followup'],
['inform_response'],
[],
[]],
'facts': [{'fid': [], 'source': [], 'used': []},
{'fid': [77870, 77676, 77816, 77814, 77775, 77659, 77877, 77785, 77867],
'source': [0, 1, 2, 2, 0, 2, 0, 1, 1],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77725, 77870, 77676, 77863, 77814, 77775, 77659, 77877, 77867],
'source': [2, 0, 1, 1, 2, 0, 2, 0, 1],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77694, 77661, 77863, 77780, 77671, 77704, 77869, 77693, 77877],
'source': [1, 2, 1, 0, 2, 2, 0, 1, 0],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 1]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77816, 77814, 77864, 77659, 77877, 77803, 77738, 77784, 77789],
'source': [2, 2, 0, 2, 0, 1, 1, 0, 1],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77694, 77776, 77780, 77696, 77707, 77693, 77778, 77702, 77743],
'source': [1, 0, 0, 2, 1, 1, 0, 2, 2],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 0]},
{'fid': [], 'source': [], 'used': []},
{'fid': [77662, 77779, 77742, 77734, 77663, 77777, 77702, 77731, 77778],
'source': [1, 0, 2, 1, 2, 0, 2, 1, 0],
'used': [0, 0, 0, 0, 0, 0, 0, 0, 1]}],
'liked': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'message': ['Hi. I want information about Namibia.',
'Nmbia is a country in southern Africa.',
'Do you have information about the media there?',
'A mentional amount of foriegn',
'What about it?',
"Media and journalists in Namibia are represented by the Namibia chapter of the Media Institute of 'southern Africa and the Editors Forum of Namibia.",
'Interesting! What can you tell me about the politics and government?',
'Namibia formed the Namibian Defence Force, comprising former enemies in a 23-year bush war.',
'Do you have more information about it?',
"With a small army and a fragile economy , the Namibian government's principal foreign policy concern is developing strengthened ties within the Southern African region.",
"That's all I wanted to know. Thank you!",
'My pleasure!'],
'message_id': ['617343895',
'2842515356',
'4240816985',
'520711081',
'1292358002',
'3677078227',
'1563061125',
'1089028270',
'1607063839',
'113037558',
'1197873991',
'1399017322'],
'sender': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]},
'related_entities': ['Western Roman Empire',
'United Kingdom',
'Portuguese language',
'Southern African Development Community',
'South Africa',
'Kalahari Desert',
'Namib Desert',
'League of Nations',
'Afrikaans',
'Sub-Saharan Africa',
'Portugal',
'South-West Africa',
'Warmbad, Namibia',
'German language',
'NBC'],
'reported': 0,
'second_aspect': 'Politics and government',
'shuffle_facts': 1,
'tag': 'round_2',
'user_dialog_rating': 5,
'user_id': 207,
'user_other_agent_rating': 5}
### 数据字段
* `messages`:用户与助手的对话列表及其关联属性
* `dialog_acts`:对话中执行的行为列表
* `facts`:助手返回的事实列表
* `fid`:事实ID
* `source`:事实来源
* `used`:同一对话中是否曾使用过该事实
* `liked`:标记各条对话是否被喜爱的数值列表
* `message`:用户与助手的对话(消息)列表
* `message_id`:消息ID
* `sender`:消息作者ID(数值型)
* `known_entities`:用户已知实体的相关事实
* `focus_entity`:对话中聚焦的实体
* `dialog_id`:对话ID
* `inferred_steps`:推断步骤数
* `created_time`:对话创建时间
* `aspects`:对话涉及的两个主题列表
* `first_aspect`:第一主题
* `second_aspect`:第二主题
* `shuffle_facts`:事实是否被打乱
* `related_entities`:与聚焦实体相关的15个实体列表
* `tag`:对话标签
* `user_id`:用户ID
* `assistant_id`:助手ID
* `is_annotated`:0或1(更多信息待补充)
* `user_dialog_rating`:1-5分(更多信息待补充)
* `user_other_agent_rating`:1-5分(更多信息待补充)
* `assistant_dialog_rating`:1-5分(更多信息待补充)
* `assistant_other_agent_rating`:1-5分(更多信息待补充)
* `reported`:对话是否被举报为不当内容
* `annotated`:0或1(更多信息待补充)
### 数据划分
该数据集按照原始划分方式分为训练集、验证集、测试集与test_zero集。
| | 训练集 | 验证集 | 测试集 | test_zero集 |
|-----------------------|------:|-----------:|-----:|----------:|
| 输入对话样本数 | 10287 | 1287 | 1287 | 1187 |
## 数据集构建
### 数据集构建依据
[更多信息待补充]
### 源数据
#### 初始数据采集与标准化
[更多信息待补充]
#### 源语言生产者是谁?
[更多信息待补充]
### 标注
#### 标注流程
[更多信息待补充]
#### 标注者是谁?
[更多信息待补充]
### 个人与敏感信息
[更多信息待补充]
## 数据集使用注意事项
### 数据集的社会影响
[更多信息待补充]
### 偏差讨论
[更多信息待补充]
### 其他已知局限
[更多信息待补充]
## 附加信息
### 数据集维护者
[更多信息待补充]
### 许可信息
[署名-非商业性使用4.0国际许可协议(Attribution-NonCommercial 4.0 International)](https://creativecommons.org/licenses/by-nc/4.0/legalcode)
### 引用信息
@inproceedings{rodriguez2020curiosity,
title = {以学习为导向的信息寻求:面向对话好奇心的数据集},
author = {Pedro Rodriguez and Paul Crook and Seungwhan Moon and Zhiguang Wang},
year = 2020,
booktitle = {自然语言处理经验方法会议}
}
### 贡献
感谢 [@vineeths96](https://github.com/vineeths96) 为本数据集添加相关内容。
提供机构:
maas
创建时间:
2025-05-20



