MDIA

arXiv2022-08-28 更新2024-06-21 收录

下载链接：

https://github.com/DoctorDream/mDIA

下载链接

链接失效反馈

官方服务：

资源简介：

MDIA数据集是由南京大学软件新技术国家重点实验室和Saarland大学的语言科学与技术系共同创建的，是首个大规模多语言对话生成基准，涵盖46种语言，跨越19个语言家族。数据集包含真实生活中的对话，主要从2020年Reddit的全年度流量中收集。MDIA旨在解决当前对话生成研究主要集中在英语等高资源语言的问题，通过提供一个多语言的对话生成平台，促进语言多样性和低资源语言的对话生成技术发展。数据集的创建过程涉及从Reddit下载用户评论，然后从中提取不同语言的对话。数据集的应用领域包括探索在有限训练数据下如何利用现有技术提高低资源语言的对话生成质量，以及评估和比较不同语言间的对话生成模型性能。

The MDIA dataset was co-developed by the State Key Laboratory for Novel Software Technology at Nanjing University and the Department of Language Science and Technology at Saarland University. It is the first large-scale multilingual dialogue generation benchmark, covering 46 languages spanning 19 language families. The dataset comprises real-life dialogues, primarily collected from the full-year traffic of Reddit in 2020. MDIA aims to address the issue that current dialogue generation research primarily focuses on high-resource languages such as English. By providing a multilingual dialogue generation platform, it promotes linguistic diversity and the advancement of dialogue generation technologies for low-resource languages. The dataset creation process involves downloading user comments from Reddit and subsequently extracting dialogues in various languages from these comments. Application scenarios of the dataset include exploring how to leverage existing technologies to improve the quality of dialogue generation for low-resource languages with limited training data, as well as evaluating and comparing the performance of dialogue generation models across different languages.

提供机构：

南京大学软件新技术国家重点实验室

创建时间：

2022-08-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集