MLSUM
收藏arXiv2020-04-30 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2004.14900v1
下载链接
链接失效反馈官方服务:
资源简介:
MLSUM是首个大规模多语言摘要数据集,包含超过150万篇来自在线报纸的文章/摘要对,涵盖法语、德语、西班牙语、俄语和土耳其语五种语言。该数据集与英文的CNN/Daily Mail数据集互补,共同构成了一个大规模的多语言数据集,旨在推动文本摘要领域的研究。MLSUM数据集的创建,通过从在线新闻源收集文章和摘要,为研究提供了丰富的语言资源,特别关注于解决跨语言模型训练中的数据稀缺问题。
MLSUM is the first large-scale multilingual summarization dataset, containing over 1.5 million article-summary pairs sourced from online newspapers across five languages: French, German, Spanish, Russian, and Turkish. This dataset complements the English CNN/Daily Mail dataset, and together they form a large-scale multilingual dataset designed to advance research in the field of text summarization. Developed by collecting articles and summaries from online news sources, the MLSUM dataset provides rich linguistic resources for research, with a particular focus on addressing the data scarcity challenge in cross-lingual model training.
提供机构:
法国国家科学研究中心
创建时间:
2020-04-30



