five

esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in Spanish, Catalan, Basque, and Galician

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15017668
下载链接
链接失效反馈
官方服务:
资源简介:
In response to the growing need for comprehensive conversational datasets, we introduce esCorpiusDialog, a large-scale, diverse dialogue corpus designed to facilitate the development of conversational models in Spanish, Catalan, Basque, and Galician. Addressing the critical shortage of high-quality conversational data for these languages, esCorpiusDialog encompasses an extensive 15.1 GiB corpus containing 26,900,187 dialogues with a total of 116,524,110 conversational turns and 1,382,496,364 tokens. On average, each dialogue includes 4.3 turns, enabling effective modeling of multi-turn interactions. esCorpiusDialog aggregates data from multiple rich sources: movie subtitles (OpenSubtitles), newsgroups (Usenet), online forums (Mediavida, Reddit), and literature (Project Gutenberg). Specifically, the dataset comprises approximately 26.6 million dialogues in Spanish, 116 thousand in Basque, over 92 thousand in Catalan, and 63 thousand in Galician, making it the most extensive multilingual conversational dataset currently available for these languages. The dataset has undergone meticulous processing to clearly define conversational turns and accurately segment dialogues, ensuring its suitability for training robust, open-domain conversational systems. With its breadth of topics and varied dialogue styles, esCorpiusDialog represents an invaluable resource for researchers and practitioners aiming to enhance the dialogue capabilities and generalization of fine-tuned large language models (LLMs) across diverse conversational applications.
创建时间:
2025-03-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作