five

Hamza-Ziyard/CNN-Daily-Mail-Sinhala

收藏
Hugging Face2023-04-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Hamza-Ziyard/CNN-Daily-Mail-Sinhala
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - summarization language: - si - en tags: - sinhala-summarization - absractive - extractive size_categories: - 1K<n<10K --- ### Dataset Summary This dataset card aims to be creating a new dataset or Sinhala news summarization tasks. It has been generated using [https://huggingface.co/datasets/cnn_dailymail] and google translate. ### Data Instances For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the [CNN / Daily Mail dataset viewer](https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0) to explore more examples. ``` {'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62', 'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.' 'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .' 'article_sinhala':'(CNN) -- බ්‍රසීලයේ රාජ්‍ය ප්‍රවෘත්ති ඒජන්සිය වන ඒජන්සියා බ්‍රසීල්ට අනුව, මීට පෙර මගීන් 86 දෙනෙකු රෝගාතුර වූ එම නෞකාවම, අඟහරුවාදා රියෝ ද ජැනයිරෝ හි නැංගුරම් ලා තිබූ නෞකාවක සිටි ඇමරිකානු කාන්තාවක් මිය ගියේය. හොලන්ඩ් ඇමරිකා කෲස් මෙහෙයුම්කරුට අයත් MS Veendam නෞකාවේදී ඇමරිකානු සංචාරකයා මිය ගියේය. ෆෙඩරල් පොලිසිය Agencia Brasil වෙත පැවසුවේ අධිකරණ වෛද්‍යවරුන් ඇයගේ මරණය පිළිබඳව විමර්ශනය කරන බවයි. නෞකාවේ වෛද්‍යවරුන් පොලිසියට පවසා ඇත්තේ එම කාන්තාව වයෝවෘද්ධ කාන්තාවක් බවත් ඇය දියවැඩියාව හා අධි රුධිර පීඩනයෙන් පෙළෙන බවත්ය. ගමනේ පෙර කොටසකදී ඇයගේ මරණයට පෙර අනෙකුත් මගීන් පාචනය වැළඳී ඇති බව නෞකාවේ වෛද්‍යවරු පැවසූහ. දකුණු අමෙරිකානු සංචාරයක් සඳහා වීන්ඩම් දින 36කට පෙර නිව්යෝර්ක් නුවරින් පිටත් විය.' 'summary_sinhala':'වයෝවෘද්ධ කාන්තාව දියවැඩියාව සහ අධි රුධිර පීඩනයෙන් පෙළුණු බව නෞකාවේ වෛද්‍යවරු පවසති.\nමීට පෙර නෞකාවේ සිටි මගීන් 86 දෙනෙකු රෝගාතුර වී ඇති බව Agencia Brasil පවසයි.'} ``` ### Data Splits The dataset has 3 splits: _train_, _validation_, and _test_. Below are the statistics forthe dataset. | Dataset Split | Number of Instances in Split | | ------------- | ------------------------------------------- | | Train | 6000 | | Validation | 2000 | | Test | 2000 | ### Social Impact of Dataset The purpose of this dataset is to help SriLankan NLP developers develop models that can summarize long paragraphs of text in one or two sentences . ### Licensing Information The CNN / Daily Mail dataset version 1.0.0 is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ``` @inproceedings{see-etal-2017-get, title = "Get To The Point: Summarization with Pointer-Generator Networks", author = "See, Abigail and Liu, Peter J. and Manning, Christopher D.", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P17-1099", doi = "10.18653/v1/P17-1099", pages = "1073--1083", abstract = "Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.", } ``` ``` @inproceedings{DBLP:conf/nips/HermannKGEKSB15, author={Karl Moritz Hermann and Tomás Kociský and Edward Grefenstette and Lasse Espeholt and Will Kay and Mustafa Suleyman and Phil Blunsom}, title={Teaching Machines to Read and Comprehend}, year={2015}, cdate={1420070400000}, pages={1693-1701}, url={http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend}, booktitle={NIPS}, crossref={conf/nips/2015} } ```
提供机构:
Hamza-Ziyard
原始信息汇总

数据集概述

  • 任务类别:摘要生成
  • 语言:僧伽罗语(si)、英语(en)
  • 标签:僧伽罗语摘要、抽象摘要、抽取摘要
  • 数据集大小:1,000<n<10,000

数据集详情

  • 数据集来源:该数据集是通过[https://huggingface.co/datasets/cnn_dailymail]和谷歌翻译生成,用于僧伽罗语新闻摘要任务。
  • 数据实例结构:每个实例包含文章文本、摘要文本和唯一标识符。
  • 数据分割:数据集分为训练集、验证集和测试集,具体如下:
    • 训练集:6,000个实例
    • 验证集:2,000个实例
    • 测试集:2,000个实例

数据集用途

  • 目的:帮助斯里兰卡的自然语言处理开发者开发能够将长段落文本摘要成一两句话的模型。

许可信息

  • 许可:CNN / Daily Mail 数据集版本 1.0.0 根据 Apache-2.0 许可证发布。

引用信息

  • 引用文献
    • See, Abigail, Liu, Peter J., and Manning, Christopher D. "Get To The Point: Summarization with Pointer-Generator Networks." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.
    • Hermann, Karl Moritz, et al. "Teaching Machines to Read and Comprehend." NIPS, 2015.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作