five

reciTAL/mlsum

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/reciTAL/mlsum
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language_creators: - found language: - de - es - fr - ru - tr license: - other multilinguality: - multilingual size_categories: - 100K<n<1M - 10K<n<100K source_datasets: - extended|cnn_dailymail - original task_categories: - summarization - translation - text-classification task_ids: - news-articles-summarization - multi-class-classification - multi-label-classification - topic-classification paperswithcode_id: mlsum pretty_name: MLSUM dataset_info: - config_name: de features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 846959840 num_examples: 220887 - name: validation num_bytes: 47119541 num_examples: 11394 - name: test num_bytes: 46847612 num_examples: 10701 download_size: 1005814154 dataset_size: 940926993 - config_name: es features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 1214558302 num_examples: 266367 - name: validation num_bytes: 50643400 num_examples: 10358 - name: test num_bytes: 71263665 num_examples: 13920 download_size: 1456211154 dataset_size: 1336465367 - config_name: fr features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 1471965014 num_examples: 392902 - name: validation num_bytes: 70413212 num_examples: 16059 - name: test num_bytes: 69660288 num_examples: 15828 download_size: 1849565564 dataset_size: 1612038514 - config_name: ru features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 257389497 num_examples: 25556 - name: validation num_bytes: 9128497 num_examples: 750 - name: test num_bytes: 9656398 num_examples: 757 download_size: 766226107 dataset_size: 276174392 - config_name: tu features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 641622783 num_examples: 249277 - name: validation num_bytes: 25530661 num_examples: 11565 - name: test num_bytes: 27830212 num_examples: 12775 download_size: 942308960 dataset_size: 694983656 config_names: - de - es - fr - ru - tu --- # Dataset Card for MLSUM ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** []() - **Repository:** https://github.com/recitalAI/MLSUM - **Paper:** https://www.aclweb.org/anthology/2020.emnlp-main.647/ - **Point of Contact:** [email](thomas@recital.ai) - **Size of downloaded dataset files:** 1.83 GB - **Size of the generated dataset:** 4.86 GB - **Total amount of disk used:** 6.69 GB ### Dataset Summary We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### de - **Size of downloaded dataset files:** 346.58 MB - **Size of the generated dataset:** 940.93 MB - **Total amount of disk used:** 1.29 GB An example of 'validation' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### es - **Size of downloaded dataset files:** 513.31 MB - **Size of the generated dataset:** 1.34 GB - **Total amount of disk used:** 1.85 GB An example of 'validation' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### fr - **Size of downloaded dataset files:** 619.99 MB - **Size of the generated dataset:** 1.61 GB - **Total amount of disk used:** 2.23 GB An example of 'validation' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### ru - **Size of downloaded dataset files:** 106.22 MB - **Size of the generated dataset:** 276.17 MB - **Total amount of disk used:** 382.39 MB An example of 'train' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### tu - **Size of downloaded dataset files:** 247.50 MB - **Size of the generated dataset:** 694.99 MB - **Total amount of disk used:** 942.48 MB An example of 'train' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` ### Data Fields The data fields are the same among all splits. #### de - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### es - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### fr - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### ru - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### tu - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. ### Data Splits |name|train |validation|test | |----|-----:|---------:|----:| |de |220887| 11394|10701| |es |266367| 10358|13920| |fr |392902| 16059|15828| |ru | 25556| 750| 757| |tu |249277| 11565|12775| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information Usage of dataset is restricted to non-commercial research purposes only. Copyright belongs to the original copyright holders. See https://github.com/recitalAI/MLSUM#mlsum ### Citation Information ``` @article{scialom2020mlsum, title={MLSUM: The Multilingual Summarization Corpus}, author={Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo}, journal={arXiv preprint arXiv:2004.14900}, year={2020} } ``` ### Contributions Thanks to [@RachelKer](https://github.com/RachelKer), [@albertvillanova](https://github.com/albertvillanova), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
提供机构:
reciTAL
原始信息汇总

数据集概述

基本信息

  • 数据集名称: MLSUM
  • 语言: 德语(de)、西班牙语(es)、法语(fr)、俄语(ru)、土耳其语(tu)
  • 许可证: 其他(仅限非商业研究用途)
  • 多语言性: 多语言
  • 数据集大小分类: 100K<n<1M, 10K<n<100K
  • 源数据集: 扩展自cnn_dailymail,原始数据
  • 任务类别: 摘要生成、翻译、文本分类
  • 任务ID: 新闻文章摘要、多类别分类、多标签分类、主题分类
  • 论文ID: mlsum

数据集结构

数据实例

每个语言配置包含以下字段:

  • text: 文章正文
  • summary: 文章摘要
  • topic: 文章主题
  • url: 文章链接
  • title: 文章标题
  • date: 文章日期

数据分割

每个语言配置包含以下数据分割:

  • train: 训练集
  • validation: 验证集
  • test: 测试集

具体数据分割信息

语言 训练集样本数 验证集样本数 测试集样本数
de 220887 11394 10701
es 266367 10358 13920
fr 392902 16059 15828
ru 25556 750 757
tu 249277 11565 12775

数据字段

所有语言配置的数据字段相同,包括:

  • text: 字符串类型
  • summary: 字符串类型
  • topic: 字符串类型
  • url: 字符串类型
  • title: 字符串类型
  • date: 字符串类型

数据集创建

数据集摘要

MLSUM 是一个大规模多语言摘要数据集,包含超过150万篇文章和摘要对,涵盖五种不同语言:法语、德语、西班牙语、俄语和土耳其语。该数据集与英语的CNN/Daily Mail数据集一起,形成了一个大规模的多语言数据集,为文本摘要领域的研究提供了新的方向。

数据集大小

  • 下载大小: 1.83 GB
  • 生成数据集大小: 4.86 GB
  • 总磁盘使用量: 6.69 GB

数据集配置

  • de: 下载大小 346.58 MB, 生成数据集大小 940.93 MB, 总磁盘使用量 1.29 GB
  • es: 下载大小 513.31 MB, 生成数据集大小 1.34 GB, 总磁盘使用量 1.85 GB
  • fr: 下载大小 619.99 MB, 生成数据集大小 1.61 GB, 总磁盘使用量 2.23 GB
  • ru: 下载大小 106.22 MB, 生成数据集大小 276.17 MB, 总磁盘使用量 382.39 MB
  • tu: 下载大小 247.50 MB, 生成数据集大小 694.99 MB, 总磁盘使用量 942.48 MB

引用信息

@article{scialom2020mlsum, title={MLSUM: The Multilingual Summarization Corpus}, author={Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo}, journal={arXiv preprint arXiv:2004.14900}, year={2020} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作