reciTAL/mlsum

Name: reciTAL/mlsum
Creator: reciTAL
Published: 2024-01-18 11:09:09
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/reciTAL/mlsum

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - de - es - fr - ru - tr license: - other multilinguality: - multilingual size_categories: - 100K<n<1M - 10K<n<100K source_datasets: - extended|cnn_dailymail - original task_categories: - summarization - translation - text-classification task_ids: - news-articles-summarization - multi-class-classification - multi-label-classification - topic-classification paperswithcode_id: mlsum pretty_name: MLSUM dataset_info: - config_name: de features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 846959840 num_examples: 220887 - name: validation num_bytes: 47119541 num_examples: 11394 - name: test num_bytes: 46847612 num_examples: 10701 download_size: 1005814154 dataset_size: 940926993 - config_name: es features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 1214558302 num_examples: 266367 - name: validation num_bytes: 50643400 num_examples: 10358 - name: test num_bytes: 71263665 num_examples: 13920 download_size: 1456211154 dataset_size: 1336465367 - config_name: fr features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 1471965014 num_examples: 392902 - name: validation num_bytes: 70413212 num_examples: 16059 - name: test num_bytes: 69660288 num_examples: 15828 download_size: 1849565564 dataset_size: 1612038514 - config_name: ru features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 257389497 num_examples: 25556 - name: validation num_bytes: 9128497 num_examples: 750 - name: test num_bytes: 9656398 num_examples: 757 download_size: 766226107 dataset_size: 276174392 - config_name: tu features: - name: text dtype: string - name: summary dtype: string - name: topic dtype: string - name: url dtype: string - name: title dtype: string - name: date dtype: string splits: - name: train num_bytes: 641622783 num_examples: 249277 - name: validation num_bytes: 25530661 num_examples: 11565 - name: test num_bytes: 27830212 num_examples: 12775 download_size: 942308960 dataset_size: 694983656 config_names: - de - es - fr - ru - tu --- # Dataset Card for MLSUM ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** []() - **Repository:** https://github.com/recitalAI/MLSUM - **Paper:** https://www.aclweb.org/anthology/2020.emnlp-main.647/ - **Point of Contact:** [email](thomas@recital.ai) - **Size of downloaded dataset files:** 1.83 GB - **Size of the generated dataset:** 4.86 GB - **Total amount of disk used:** 6.69 GB ### Dataset Summary We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### de - **Size of downloaded dataset files:** 346.58 MB - **Size of the generated dataset:** 940.93 MB - **Total amount of disk used:** 1.29 GB An example of 'validation' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### es - **Size of downloaded dataset files:** 513.31 MB - **Size of the generated dataset:** 1.34 GB - **Total amount of disk used:** 1.85 GB An example of 'validation' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### fr - **Size of downloaded dataset files:** 619.99 MB - **Size of the generated dataset:** 1.61 GB - **Total amount of disk used:** 2.23 GB An example of 'validation' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### ru - **Size of downloaded dataset files:** 106.22 MB - **Size of the generated dataset:** 276.17 MB - **Total amount of disk used:** 382.39 MB An example of 'train' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` #### tu - **Size of downloaded dataset files:** 247.50 MB - **Size of the generated dataset:** 694.99 MB - **Total amount of disk used:** 942.48 MB An example of 'train' looks as follows. ``` { "date": "01/01/2001", "summary": "A text", "text": "This is a text", "title": "A sample", "topic": "football", "url": "https://www.google.com" } ``` ### Data Fields The data fields are the same among all splits. #### de - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### es - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### fr - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### ru - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. #### tu - `text`: a `string` feature. - `summary`: a `string` feature. - `topic`: a `string` feature. - `url`: a `string` feature. - `title`: a `string` feature. - `date`: a `string` feature. ### Data Splits |name|train |validation|test | |----|-----:|---------:|----:| |de |220887| 11394|10701| |es |266367| 10358|13920| |fr |392902| 16059|15828| |ru | 25556| 750| 757| |tu |249277| 11565|12775| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information Usage of dataset is restricted to non-commercial research purposes only. Copyright belongs to the original copyright holders. See https://github.com/recitalAI/MLSUM#mlsum ### Citation Information ``` @article{scialom2020mlsum, title={MLSUM: The Multilingual Summarization Corpus}, author={Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo}, journal={arXiv preprint arXiv:2004.14900}, year={2020} } ``` ### Contributions Thanks to [@RachelKer](https://github.com/RachelKer), [@albertvillanova](https://github.com/albertvillanova), [@thomwolf](https://github.com/thomwolf) for adding this dataset.

提供机构：

reciTAL

原始信息汇总

数据集概述

基本信息

数据集名称: MLSUM
语言: 德语（de）、西班牙语（es）、法语（fr）、俄语（ru）、土耳其语（tu）
许可证: 其他（仅限非商业研究用途）
多语言性: 多语言
数据集大小分类: 100K<n<1M, 10K<n<100K
源数据集: 扩展自cnn_dailymail，原始数据
任务类别: 摘要生成、翻译、文本分类
任务ID: 新闻文章摘要、多类别分类、多标签分类、主题分类
论文ID: mlsum

数据集结构

数据实例

每个语言配置包含以下字段：

text: 文章正文
summary: 文章摘要
topic: 文章主题
url: 文章链接
title: 文章标题
date: 文章日期

数据分割

每个语言配置包含以下数据分割：

train: 训练集
validation: 验证集
test: 测试集

具体数据分割信息

语言	训练集样本数	验证集样本数	测试集样本数
de	220887	11394	10701
es	266367	10358	13920
fr	392902	16059	15828
ru	25556	750	757
tu	249277	11565	12775

数据字段

所有语言配置的数据字段相同，包括：

text: 字符串类型
summary: 字符串类型
topic: 字符串类型
url: 字符串类型
title: 字符串类型
date: 字符串类型

数据集创建

数据集摘要

MLSUM 是一个大规模多语言摘要数据集，包含超过150万篇文章和摘要对，涵盖五种不同语言：法语、德语、西班牙语、俄语和土耳其语。该数据集与英语的CNN/Daily Mail数据集一起，形成了一个大规模的多语言数据集，为文本摘要领域的研究提供了新的方向。

数据集大小

下载大小: 1.83 GB
生成数据集大小: 4.86 GB
总磁盘使用量: 6.69 GB

数据集配置

de: 下载大小 346.58 MB, 生成数据集大小 940.93 MB, 总磁盘使用量 1.29 GB
es: 下载大小 513.31 MB, 生成数据集大小 1.34 GB, 总磁盘使用量 1.85 GB
fr: 下载大小 619.99 MB, 生成数据集大小 1.61 GB, 总磁盘使用量 2.23 GB
ru: 下载大小 106.22 MB, 生成数据集大小 276.17 MB, 总磁盘使用量 382.39 MB
tu: 下载大小 247.50 MB, 生成数据集大小 694.99 MB, 总磁盘使用量 942.48 MB

引用信息

@article{scialom2020mlsum, title={MLSUM: The Multilingual Summarization Corpus}, author={Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo}, journal={arXiv preprint arXiv:2004.14900}, year={2020} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集