esCorpiusDialog: A Large-Scale Multilingual Dialogue Dataset in Spanish, Catalan, Basque, and Galician
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15017668
下载链接
链接失效反馈官方服务:
资源简介:
In response to the growing need for comprehensive conversational datasets, we introduce esCorpiusDialog, a large-scale, diverse dialogue corpus designed to facilitate the development of conversational models in Spanish, Catalan, Basque, and Galician. Addressing the critical shortage of high-quality conversational data for these languages, esCorpiusDialog encompasses an extensive 15.1 GiB corpus containing 26,900,187 dialogues with a total of 116,524,110 conversational turns and 1,382,496,364 tokens. On average, each dialogue includes 4.3 turns, enabling effective modeling of multi-turn interactions.
esCorpiusDialog aggregates data from multiple rich sources: movie subtitles (OpenSubtitles), newsgroups (Usenet), online forums (Mediavida, Reddit), and literature (Project Gutenberg). Specifically, the dataset comprises approximately 26.6 million dialogues in Spanish, 116 thousand in Basque, over 92 thousand in Catalan, and 63 thousand in Galician, making it the most extensive multilingual conversational dataset currently available for these languages.
The dataset has undergone meticulous processing to clearly define conversational turns and accurately segment dialogues, ensuring its suitability for training robust, open-domain conversational systems. With its breadth of topics and varied dialogue styles, esCorpiusDialog represents an invaluable resource for researchers and practitioners aiming to enhance the dialogue capabilities and generalization of fine-tuned large language models (LLMs) across diverse conversational applications.
创建时间:
2025-03-14



