Gutenberg Dialogue Dataset

Name: Gutenberg Dialogue Dataset
Creator: 布达佩斯技术大学
Published: 2021-01-23 01:54:25
License: 暂无描述

arXiv2021-01-23 更新2024-06-21 收录

下载链接：

https://github.com/ricsinaruto/gutenberg-dialog

下载链接

链接失效反馈

官方服务：

资源简介：

Gutenberg Dialogue Dataset是由布达佩斯技术大学和牛津大学合作创建的一个高质量对话数据集，包含1480万条英语对话及德语、荷兰语、西班牙语、葡萄牙语、意大利语和匈牙利语的小规模数据集。该数据集从Project Gutenberg提供的公共领域书籍中提取和处理对话，旨在填补现有开放领域对话数据集在质量和大小上的权衡。数据集创建过程中使用了多种启发式方法，并进行了详细的错误分析。Gutenberg Dialogue Dataset特别适用于零样本和微调设置下的响应质量提升，为研究人员提供了一个更好的大小-质量权衡的数据集，适用于多语言和多轮对话模型的训练。

The Gutenberg Dialogue Dataset is a high-quality conversational dataset co-developed by the Budapest University of Technology and Economics and the University of Oxford. It consists of 14.8 million English dialogues, along with smaller-scale datasets in German, Dutch, Spanish, Portuguese, Italian and Hungarian. This dataset extracts and processes dialogues from public-domain books provided by Project Gutenberg, aiming to address the trade-off between quality and scale that plagues current open-domain conversational datasets. Multiple heuristic methods were employed during its creation, with detailed error analysis conducted throughout the development process. The Gutenberg Dialogue Dataset is particularly suitable for improving response quality in both zero-shot and fine-tuning settings, providing researchers with a better scale-quality trade-off option for training multilingual and multi-turn conversational models.

提供机构：

布达佩斯技术大学

创建时间：

2020-04-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集