M3LS
收藏arXiv2023-02-14 更新2024-06-21 收录
下载链接:
https://github.com/anubhav-jangra/M3LS
下载链接
链接失效反馈官方服务:
资源简介:
M3LS数据集是目前最大的多语言多模态摘要数据集,包含超过110万条文档-图像对,每对均配有专业标注的多模态摘要。数据集源自BBC新闻过去十年的文章,覆盖20种语言,旨在实现跨五种语言根的多样性。该数据集不仅支持13种语言的最大规模摘要任务,还包含2种语言的跨语言摘要数据。M3LS数据集的应用领域广泛,包括自动摘要、文章标题生成、关键词提取、图像标题生成等,旨在推动多模态和多语言研究的发展。
The M3LS dataset is currently the largest multilingual multimodal summarization dataset, containing over 1.1 million document-image pairs, each paired with professionally annotated multimodal summaries. Derived from BBC News articles spanning the past decade and covering 20 languages, the dataset is designed to ensure diversity across five language families. It not only supports large-scale summarization tasks in 13 languages but also includes cross-lingual summarization data for two additional languages. The M3LS dataset has broad application scenarios including automatic summarization, article title generation, keyword extraction, image captioning and more, aiming to advance the development of multimodal and multilingual research.
提供机构:
印度理工学院巴特那分校
创建时间:
2023-02-14



