Hamza-Ziyard/CNN-Daily-Mail-Sinhala
收藏Hugging Face2023-04-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Hamza-Ziyard/CNN-Daily-Mail-Sinhala
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- summarization
language:
- si
- en
tags:
- sinhala-summarization
- absractive
- extractive
size_categories:
- 1K<n<10K
---
### Dataset Summary
This dataset card aims to be creating a new dataset or Sinhala news summarization tasks. It has been generated using [https://huggingface.co/datasets/cnn_dailymail] and google translate.
### Data Instances
For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the [CNN / Daily Mail dataset viewer](https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0) to explore more examples.
```
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'
'article_sinhala':'(CNN) -- බ්රසීලයේ රාජ්ය ප්රවෘත්ති ඒජන්සිය වන ඒජන්සියා බ්රසීල්ට අනුව, මීට පෙර මගීන් 86 දෙනෙකු රෝගාතුර වූ එම නෞකාවම, අඟහරුවාදා රියෝ ද ජැනයිරෝ හි නැංගුරම් ලා තිබූ නෞකාවක සිටි ඇමරිකානු කාන්තාවක් මිය ගියේය. හොලන්ඩ් ඇමරිකා කෲස් මෙහෙයුම්කරුට අයත් MS Veendam නෞකාවේදී ඇමරිකානු සංචාරකයා මිය ගියේය. ෆෙඩරල් පොලිසිය Agencia Brasil වෙත පැවසුවේ අධිකරණ වෛද්යවරුන් ඇයගේ මරණය පිළිබඳව විමර්ශනය කරන බවයි. නෞකාවේ වෛද්යවරුන් පොලිසියට පවසා ඇත්තේ එම කාන්තාව වයෝවෘද්ධ කාන්තාවක් බවත් ඇය දියවැඩියාව හා අධි රුධිර පීඩනයෙන් පෙළෙන බවත්ය. ගමනේ පෙර කොටසකදී ඇයගේ මරණයට පෙර අනෙකුත් මගීන් පාචනය වැළඳී ඇති බව නෞකාවේ වෛද්යවරු පැවසූහ. දකුණු අමෙරිකානු සංචාරයක් සඳහා වීන්ඩම් දින 36කට පෙර නිව්යෝර්ක් නුවරින් පිටත් විය.'
'summary_sinhala':'වයෝවෘද්ධ කාන්තාව දියවැඩියාව සහ අධි රුධිර පීඩනයෙන් පෙළුණු බව නෞකාවේ වෛද්යවරු පවසති.\nමීට පෙර නෞකාවේ සිටි මගීන් 86 දෙනෙකු රෝගාතුර වී ඇති බව Agencia Brasil පවසයි.'}
```
### Data Splits
The dataset has 3 splits: _train_, _validation_, and _test_. Below are the statistics forthe dataset.
| Dataset Split | Number of Instances in Split |
| ------------- | ------------------------------------------- |
| Train | 6000 |
| Validation | 2000 |
| Test | 2000 |
### Social Impact of Dataset
The purpose of this dataset is to help SriLankan NLP developers develop models that can summarize long paragraphs of text in one or two sentences .
### Licensing Information
The CNN / Daily Mail dataset version 1.0.0 is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0).
### Citation Information
```
@inproceedings{see-etal-2017-get,
title = "Get To The Point: Summarization with Pointer-Generator Networks",
author = "See, Abigail and
Liu, Peter J. and
Manning, Christopher D.",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P17-1099",
doi = "10.18653/v1/P17-1099",
pages = "1073--1083",
abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.",
}
```
```
@inproceedings{DBLP:conf/nips/HermannKGEKSB15,
author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom},
title={Teaching Machines to Read and Comprehend},
year={2015},
cdate={1420070400000},
pages={1693-1701},
url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend},
booktitle={NIPS},
crossref={conf/nips/2015}
}
```
提供机构:
Hamza-Ziyard
原始信息汇总
数据集概述
- 任务类别:摘要生成
- 语言:僧伽罗语(si)、英语(en)
- 标签:僧伽罗语摘要、抽象摘要、抽取摘要
- 数据集大小:1,000<n<10,000
数据集详情
- 数据集来源:该数据集是通过[https://huggingface.co/datasets/cnn_dailymail]和谷歌翻译生成,用于僧伽罗语新闻摘要任务。
- 数据实例结构:每个实例包含文章文本、摘要文本和唯一标识符。
- 数据分割:数据集分为训练集、验证集和测试集,具体如下:
- 训练集:6,000个实例
- 验证集:2,000个实例
- 测试集:2,000个实例
数据集用途
- 目的:帮助斯里兰卡的自然语言处理开发者开发能够将长段落文本摘要成一两句话的模型。
许可信息
- 许可:CNN / Daily Mail 数据集版本 1.0.0 根据 Apache-2.0 许可证发布。
引用信息
- 引用文献:
- See, Abigail, Liu, Peter J., and Manning, Christopher D. "Get To The Point: Summarization with Pointer-Generator Networks." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.
- Hermann, Karl Moritz, et al. "Teaching Machines to Read and Comprehend." NIPS, 2015.



