five

ml6team/cnn_dailymail_nl

收藏
Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ml6team/cnn_dailymail_nl
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - nl license: - mit multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - https://github.com/huggingface/datasets/tree/master/datasets/cnn_dailymail task_categories: - conditional-text-generation task_ids: - summarization --- # Dataset Card for Dutch CNN Dailymail Dataset ## Dataset Description - **Repository:** [CNN / DailyMail Dataset NL repository](https://huggingface.co/datasets/ml6team/cnn_dailymail_nl) ### Dataset Summary The Dutch CNN / DailyMail Dataset is a machine-translated version of the English CNN / Dailymail dataset containing just over 300k unique news aticles as written by journalists at CNN and the Daily Mail. Most information about the dataset can be found on the [HuggingFace page](https://huggingface.co/datasets/cnn_dailymail) of the original English version. These are the basic steps used to create this dataset (+ some chunking): ``` load_dataset("cnn_dailymail", '3.0.0') ``` And this is the HuggingFace translation pipeline: ``` pipeline( task='translation_en_to_nl', model='Helsinki-NLP/opus-mt-en-nl', tokenizer='Helsinki-NLP/opus-mt-en-nl') ``` ### Data Fields - `id`: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from - `article`: a string containing the body of the news article - `highlights`: a string containing the highlight of the article as written by the article author ### Data Splits The Dutch CNN/DailyMail dataset follows the same splits as the original English version and has 3 splits: _train_, _validation_, and _test_. | Dataset Split | Number of Instances in Split | | ------------- | ------------------------------------------- | | Train | 287,113 | | Validation | 13,368 | | Test | 11,490 |
提供机构:
ml6team
原始信息汇总

荷兰CNN/DailyMail数据集

数据集描述

数据集概述

荷兰CNN/DailyMail数据集是英文CNN/Dailymail数据集的机器翻译版本,包含超过30万篇由CNN和Daily Mail记者撰写的独特新闻文章。

数据字段

  • id: 包含故事来源URL的十六进制格式SHA1哈希的字符串
  • article: 包含新闻文章内容的字符串
  • highlights: 包含文章作者撰写的文章摘要的字符串

数据分割

荷兰CNN/DailyMail数据集遵循与原始英文版本相同的分割方式,包含三个分割:train, validation, 和 test

数据集分割 分割中的实例数量
Train 287,113
Validation 13,368
Test 11,490
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作