facebook/empathetic_dialogues
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/facebook/empathetic_dialogues
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language:
- en
language_creators:
- crowdsourced
license:
- cc-by-nc-4.0
multilinguality:
- monolingual
pretty_name: EmpatheticDialogues
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- conversational
- question-answering
task_ids:
- dialogue-generation
- open-domain-qa
paperswithcode_id: empatheticdialogues
dataset_info:
features:
- name: conv_id
dtype: string
- name: utterance_idx
dtype: int32
- name: context
dtype: string
- name: prompt
dtype: string
- name: speaker_idx
dtype: int32
- name: utterance
dtype: string
- name: selfeval
dtype: string
- name: tags
dtype: string
splits:
- name: test
num_bytes: 3011332
num_examples: 10943
- name: train
num_bytes: 19040509
num_examples: 76673
- name: validation
num_bytes: 3077481
num_examples: 12030
download_size: 28022709
dataset_size: 25129322
---
# Dataset Card for "empathetic_dialogues"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/facebookresearch/EmpatheticDialogues](https://github.com/facebookresearch/EmpatheticDialogues)
- **Repository:** https://github.com/facebookresearch/EmpatheticDialogues
- **Paper:** [Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset](https://arxiv.org/abs/1811.00207)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 28.02 MB
- **Size of the generated dataset:** 25.13 MB
- **Total amount of disk used:** 53.15 MB
### Dataset Summary
PyTorch original implementation of Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 28.02 MB
- **Size of the generated dataset:** 25.13 MB
- **Total amount of disk used:** 53.15 MB
An example of 'train' looks as follows.
```
{
"context": "sentimental",
"conv_id": "hit:0_conv:1",
"prompt": "I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.",
"selfeval": "5|5|5_2|2|5",
"speaker_idx": 1,
"tags": "",
"utterance": "I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.",
"utterance_idx": 1
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `conv_id`: a `string` feature.
- `utterance_idx`: a `int32` feature.
- `context`: a `string` feature.
- `prompt`: a `string` feature.
- `speaker_idx`: a `int32` feature.
- `utterance`: a `string` feature.
- `selfeval`: a `string` feature.
- `tags`: a `string` feature.
### Data Splits
| name |train|validation|test |
|-------|----:|---------:|----:|
|default|76673| 12030|10943|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
Creative Commons [Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/).
### Citation Information
```
@inproceedings{rashkin-etal-2019-towards,
title = "Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset",
author = "Rashkin, Hannah and
Smith, Eric Michael and
Li, Margaret and
Boureau, Y-Lan",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1534",
doi = "10.18653/v1/P19-1534",
pages = "5370--5381",
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@patrickvonplaten](https://github.com/patrickvonplaten), [@lewtun](https://github.com/lewtun) for adding this dataset.
annotations_creators:
- 众包(crowdsourced)
language:
- 英语(en)
language_creators:
- 众包(crowdsourced)
license:
- 知识共享署名-非商业性使用4.0国际许可协议(cc-by-nc-4.0)
multilinguality:
- 单语言(monolingual)
pretty_name: 共情对话数据集(EmpatheticDialogues)
size_categories:
- 10000 < 样本数 < 100000
source_datasets:
- 原生数据集(original)
task_categories:
- 对话式任务(conversational)
- 问答任务(question-answering)
task_ids:
- 对话生成(dialogue-generation)
- 开放域问答(open-domain-qa)
paperswithcode_id: empatheticdialogues
dataset_info:
features:
- 名称:对话ID(conv_id),数据类型:字符串(string)
- 名称:话语索引(utterance_idx),数据类型:int32
- 名称:语境(context),数据类型:字符串(string)
- 名称:提示语(prompt),数据类型:字符串(string)
- 名称:说话者索引(speaker_idx),数据类型:int32
- 名称:话语(utterance),数据类型:字符串(string)
- 名称:自我评估(selfeval),数据类型:字符串(string)
- 名称:标签(tags),数据类型:字符串(string)
splits:
- 名称:测试集(test),字节数:3011332,样本数:10943
- 名称:训练集(train),字节数:19040509,样本数:76673
- 名称:验证集(validation),字节数:3077481,样本数:12030
download_size: 28022709(约28.02 MB)
dataset_size: 25129322(约25.13 MB)
---
# 共情对话数据集(EmpatheticDialogues)数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持的任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [注释](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏见讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:[https://github.com/facebookresearch/EmpatheticDialogues](https://github.com/facebookresearch/EmpatheticDialogues)
- **代码仓库**:https://github.com/facebookresearch/EmpatheticDialogues
- **相关论文**:[《面向共情式开放域对话模型:全新基准与数据集》(Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset)](https://arxiv.org/abs/1811.00207)
- **联系人**:[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小**:28.02 MB
- **生成数据集大小**:25.13 MB
- **总磁盘占用量**:53.15 MB
### 数据集摘要
本数据集对应论文《面向共情式开放域对话模型:全新基准与数据集》的PyTorch原生实现。
### 支持的任务与排行榜
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 默认格式
- **下载数据集文件大小**:28.02 MB
- **生成数据集大小**:25.13 MB
- **总磁盘占用量**:53.15 MB
一段训练集样本示例如下:
{
"context": "sentimental",
"conv_id": "hit:0_conv:1",
"prompt": "I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.",
"selfeval": "5|5|5_2|2|5",
"speaker_idx": 1,
"tags": "",
"utterance": "I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.",
"utterance_idx": 1
}
### 数据字段
所有数据划分下的字段均保持一致:
#### 默认格式
- `conv_id`(对话ID):字符串(string)类型特征
- `utterance_idx`(话语索引):int32类型特征
- `context`(语境):字符串(string)类型特征
- `prompt`(提示语):字符串(string)类型特征
- `speaker_idx`(说话者索引):int32类型特征
- `utterance`(话语):字符串(string)类型特征
- `selfeval`(自我评估):字符串(string)类型特征
- `tags`(标签):字符串(string)类型特征
### 数据划分
| 数据划分 | 训练集样本量 | 验证集样本量 | 测试集样本量 |
| :------- | -----------: | -----------: | -----------: |
| 默认格式 | 76673 | 12030 | 10943 |
## 数据集构建
### 构建初衷
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生成者是谁?
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 注释
#### 注释流程
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 注释者是谁?
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据使用注意事项
### 数据集的社会影响
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏见讨论
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
知识共享署名-非商业性使用4.0国际许可协议(Creative Commons Attribution-NonCommercial 4.0 International)。
### 引用信息
@inproceedings{rashkin-etal-2019-towards,
title = "Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset",
author = "Rashkin, Hannah and
Smith, Eric Michael and
Li, Margaret and
Boureau, Y-Lan",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1534",
doi = "10.18653/v1/P19-1534",
pages = "5370--5381",
}
### 贡献致谢
感谢 [@thomwolf](https://github.com/thomwolf)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@lewtun](https://github.com/lewtun) 为本数据集的添加工作。
提供机构:
facebook
原始信息汇总
数据集概述
基本信息
- 名称: EmpatheticDialogues
- 语言: 英语(en)
- 创建方式: 众包(crowdsourced)
- 许可证: CC-BY-NC-4.0
- 多语言性: 单语种(monolingual)
- 大小: 10K<n<100K
- 源数据: 原始数据(original)
- 任务类别: 对话生成(dialogue-generation), 开放领域问答(open-domain-qa)
数据结构
-
特征:
conv_id: 字符串类型utterance_idx: 整数类型(int32)context: 字符串类型prompt: 字符串类型speaker_idx: 整数类型(int32)utterance: 字符串类型selfeval: 字符串类型tags: 字符串类型
-
数据分割:
train: 76673个样本,19040509字节validation: 12030个样本,3077481字节test: 10943个样本,3011332字节
数据下载与大小
- 下载大小: 28022709字节
- 数据集大小: 25129322字节
引用信息
@inproceedings{rashkin-etal-2019-towards, title = "Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset", author = "Rashkin, Hannah and Smith, Eric Michael and Li, Margaret and Boureau, Y-Lan", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1534", doi = "10.18653/v1/P19-1534", pages = "5370--5381", }
搜集汇总
数据集介绍

构建方式
EmpatheticDialogues数据集的构建基于众包方式,通过收集和整理来自不同个体的对话数据,旨在捕捉和表达情感共鸣。数据集的创建过程中,参与者被要求根据给定的情感背景进行对话,从而生成具有情感深度的对话内容。这种构建方式确保了数据集的多样性和真实性,为研究情感智能提供了丰富的语料资源。
特点
EmpatheticDialogues数据集的显著特点在于其专注于情感表达和共鸣,对话内容不仅涉及日常交流,还深入探讨了情感体验。数据集包含了多种情感标签,如悲伤、喜悦、愤怒等,使得研究者能够分析和训练模型在不同情感场景下的表现。此外,数据集的结构设计合理,包含了对话的上下文、发言者信息以及自我评估等字段,为情感分析和对话生成提供了全面的数据支持。
使用方法
EmpatheticDialogues数据集适用于多种自然语言处理任务,包括对话生成、情感分析和开放领域问答。研究者可以通过加载数据集,利用其中的对话上下文和情感标签进行模型训练和评估。数据集的结构清晰,提供了详细的字段信息,便于研究者进行数据预处理和特征提取。此外,数据集的许可证允许非商业用途,为学术研究和教育提供了便利。
背景与挑战
背景概述
在自然语言处理领域,对话系统的情感智能逐渐成为研究焦点。EmpatheticDialogues数据集由Facebook Research团队于2019年创建,旨在推动开放领域对话模型在情感理解与表达方面的进步。该数据集的核心研究问题是如何使对话系统更具同理心,从而提升用户体验。通过众包方式收集的对话数据,涵盖了丰富的情感场景,为研究人员提供了一个评估和训练情感智能对话模型的基准。这一数据集的推出,不仅丰富了对话系统的研究资源,也为情感计算领域的发展提供了新的视角。
当前挑战
EmpatheticDialogues数据集在构建过程中面临多重挑战。首先,情感对话的收集和标注需要高度专业化的技能,确保数据的真实性和情感表达的准确性。其次,如何在数据中平衡不同情感类型的分布,避免偏见,是一个复杂的问题。此外,数据集的规模和多样性也对模型的训练提出了高要求,尤其是在处理复杂情感交互时,模型的泛化能力面临考验。最后,数据集的使用需谨慎处理个人和敏感信息,确保隐私保护,这也是情感对话数据集普遍面临的挑战。
常用场景
经典使用场景
在情感智能对话系统领域,EmpatheticDialogues数据集的经典使用场景主要集中在开发和评估能够理解和回应用户情感的对话模型。通过该数据集,研究者可以训练模型以生成富有同情心的回复,从而提升对话系统的情感智能水平。这种应用不仅有助于增强人机交互的自然性和亲和力,还能在心理健康支持、客户服务等实际场景中发挥重要作用。
解决学术问题
EmpatheticDialogues数据集解决了情感智能对话系统中长期存在的学术研究问题,即如何使机器能够理解和回应人类的情感。通过提供丰富的情感对话样本,该数据集为研究者提供了一个标准化的基准,用于评估和改进对话模型的情感理解能力。这不仅推动了情感计算领域的发展,还为构建更加人性化和智能化的对话系统奠定了基础。
衍生相关工作
基于EmpatheticDialogues数据集,研究者们开展了一系列相关工作,包括情感识别模型的改进、情感对话生成算法的优化以及情感对话系统的评估方法的创新。这些工作不仅深化了对情感智能对话系统的理解,还推动了相关技术的实际应用。例如,一些研究通过结合该数据集与其他情感数据集,进一步提升了模型的情感理解和生成能力,为情感智能对话系统的发展提供了新的思路和方法。
以上内容由遇见数据集搜集并总结生成



