ml6team/cnn_dailymail_nl

Name: ml6team/cnn_dailymail_nl
Creator: ml6team
Published: 2022-10-22 14:03:06
License: 暂无描述

Hugging Face2022-10-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ml6team/cnn_dailymail_nl

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - found language: - nl license: - mit multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - https://github.com/huggingface/datasets/tree/master/datasets/cnn_dailymail task_categories: - conditional-text-generation task_ids: - summarization --- # Dataset Card for Dutch CNN Dailymail Dataset ## Dataset Description - **Repository:** [CNN / DailyMail Dataset NL repository](https://huggingface.co/datasets/ml6team/cnn_dailymail_nl) ### Dataset Summary The Dutch CNN / DailyMail Dataset is a machine-translated version of the English CNN / Dailymail dataset containing just over 300k unique news aticles as written by journalists at CNN and the Daily Mail. Most information about the dataset can be found on the [HuggingFace page](https://huggingface.co/datasets/cnn_dailymail) of the original English version. These are the basic steps used to create this dataset (+ some chunking): ``` load_dataset("cnn_dailymail", '3.0.0') ``` And this is the HuggingFace translation pipeline: ``` pipeline( task='translation_en_to_nl', model='Helsinki-NLP/opus-mt-en-nl', tokenizer='Helsinki-NLP/opus-mt-en-nl') ``` ### Data Fields - `id`: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from - `article`: a string containing the body of the news article - `highlights`: a string containing the highlight of the article as written by the article author ### Data Splits The Dutch CNN/DailyMail dataset follows the same splits as the original English version and has 3 splits: _train_, _validation_, and _test_. | Dataset Split | Number of Instances in Split | | ------------- | ------------------------------------------- | | Train | 287,113 | | Validation | 13,368 | | Test | 11,490 |

提供机构：

ml6team

原始信息汇总

荷兰CNN/DailyMail数据集

数据集描述

数据集概述

荷兰CNN/DailyMail数据集是英文CNN/Dailymail数据集的机器翻译版本，包含超过30万篇由CNN和Daily Mail记者撰写的独特新闻文章。

数据字段

id: 包含故事来源URL的十六进制格式SHA1哈希的字符串
article: 包含新闻文章内容的字符串
highlights: 包含文章作者撰写的文章摘要的字符串

数据分割

荷兰CNN/DailyMail数据集遵循与原始英文版本相同的分割方式，包含三个分割：train, validation, 和 test。

数据集分割	分割中的实例数量
Train	287,113
Validation	13,368
Test	11,490

5,000+

优质数据集

54 个

任务类型

进入经典数据集