ml6team/cnn_dailymail_nl
收藏Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ml6team/cnn_dailymail_nl
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- nl
license:
- mit
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- https://github.com/huggingface/datasets/tree/master/datasets/cnn_dailymail
task_categories:
- conditional-text-generation
task_ids:
- summarization
---
# Dataset Card for Dutch CNN Dailymail Dataset
## Dataset Description
- **Repository:** [CNN / DailyMail Dataset NL repository](https://huggingface.co/datasets/ml6team/cnn_dailymail_nl)
### Dataset Summary
The Dutch CNN / DailyMail Dataset is a machine-translated version of the English CNN / Dailymail dataset containing just over 300k unique news aticles as written by journalists at CNN and the Daily Mail.
Most information about the dataset can be found on the [HuggingFace page](https://huggingface.co/datasets/cnn_dailymail) of the original English version.
These are the basic steps used to create this dataset (+ some chunking):
```
load_dataset("cnn_dailymail", '3.0.0')
```
And this is the HuggingFace translation pipeline:
```
pipeline(
task='translation_en_to_nl',
model='Helsinki-NLP/opus-mt-en-nl',
tokenizer='Helsinki-NLP/opus-mt-en-nl')
```
### Data Fields
- `id`: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
- `article`: a string containing the body of the news article
- `highlights`: a string containing the highlight of the article as written by the article author
### Data Splits
The Dutch CNN/DailyMail dataset follows the same splits as the original English version and has 3 splits: _train_, _validation_, and _test_.
| Dataset Split | Number of Instances in Split |
| ------------- | ------------------------------------------- |
| Train | 287,113 |
| Validation | 13,368 |
| Test | 11,490 |
提供机构:
ml6team
原始信息汇总
荷兰CNN/DailyMail数据集
数据集描述
数据集概述
荷兰CNN/DailyMail数据集是英文CNN/Dailymail数据集的机器翻译版本,包含超过30万篇由CNN和Daily Mail记者撰写的独特新闻文章。
数据字段
id: 包含故事来源URL的十六进制格式SHA1哈希的字符串article: 包含新闻文章内容的字符串highlights: 包含文章作者撰写的文章摘要的字符串
数据分割
荷兰CNN/DailyMail数据集遵循与原始英文版本相同的分割方式,包含三个分割:train, validation, 和 test。
| 数据集分割 | 分割中的实例数量 |
|---|---|
| Train | 287,113 |
| Validation | 13,368 |
| Test | 11,490 |



