five

sustcsenlp/bn_news_summarization

收藏
Hugging Face2023-03-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sustcsenlp/bn_news_summarization
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - summarization language: - bn size_categories: - 1K<n<10K --- # Bengali Abstractive News Summarization (BANS) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** [BANS PAPER](https://doi.org/10.1007/978-981-33-4673-4_4) - **Leaderboard:** - **Point of Contact:** [Prithwiraj Bhattacharjee](prithwiraj_cse@lus.ac.bd) ### Dataset Summary Nowadays news or text summarization becomes very popular in the NLP field. Both the extractive and abstractive approaches of summarization are implemented in different languages. A significant amount of data is a primary need for any summarization. For the Bengali language, there are only a few datasets are available. Our dataset is made for Bengali Abstractive News Summarization (BANS) purposes. As abstractive summarization is basically neural network-based it needs more and more data to perform well. So we made a standard Bengali abstractive summarization data crawling from online Bengali news portal bangla.bdnews24.com. We crawled more than 19k articles and summaries and standardized the data. ### Downloading the data ``` from datasets import load_dataset train = load_dataset("sustcsenlp/bn_news_summarization",split="train") ``` ### Dataset Description | Description | Data Info. | | ----------- | ----------- | | Total no of articles | 19096 | | Total no of summaries | 19096 | | Maximum no of words in an article | 76 | | Maximum no of words in a summary | 12 | | Minimum no of words in an article | 5 | | Minimum no of words in a summary | 3 | ### Languages This dataset contains Bangla Text Data. ## Acknowledgement We would like to thank Shahjalal University of Science and Technology (SUST) research center and SUST NLP research group for their support. ### Citation Information ``` @InProceedings{10.1007/978-981-33-4673-4_4, author="Bhattacharjee, Prithwiraj and Mallick, Avi and Saiful Islam, Md. and Marium-E-Jannat", editor="Kaiser, M. Shamim and Bandyopadhyay, Anirban and Mahmud, Mufti and Ray, Kanad", title="Bengali Abstractive News Summarization (BANS): A Neural Attention Approach", booktitle="Proceedings of International Conference on Trends in Computational and Cognitive Engineering", year="2021", publisher="Springer Singapore", address="Singapore", pages="41--51", abstract="Bhattacharjee, PrithwirajMallick, AviSaiful Islam, Md.Marium-E-JannatAbstractive summarization is the process of generating novel sentences based on the information extracted from the original text document while retaining the context. Due to abstractive summarization's underlying complexities, most of the past research work has been done on the extractive summarization approach. Nevertheless, with the triumph of the sequence-to-sequence (seq2seq) model, abstractive summarization becomes more viable. Although a significant number of notable research has been done in the English language based on abstractive summarization, only a couple of works have been done on Bengali abstractive news summarization (BANS). In this article, we presented a seq2seq based Long Short-Term Memory (LSTM) network model with attention at encoder-decoder. Our proposed system deploys a local attention-based model that produces a long sequence of words with lucid and human-like generated sentences with noteworthy information of the original document. We also prepared a dataset of more than 19 k articles and corresponding human-written summaries collected from bangla.bdnews24.com (https://bangla.bdnews24.com/) which is till now the most extensive dataset for Bengali news document summarization and publicly published in Kaggle (https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset) We evaluated our model qualitatively and quantitatively and compared it with other published results. It showed significant improvement in terms of human evaluation scores with state-of-the-art approaches for BANS.", isbn="978-981-33-4673-4" } ``` ### Contributors | Name | University | | ----------- | ----------- | | Prithwiraj Bhattacharjee | Shahjalal University of Science and Technology | | Avi Mallick | Shahjalal University of Science and Technology | | Md. Saiful Islam | Shahjalal University of Science and Technology | | Marium-E-Jannat | Shahjalal University of Science and Technology |
提供机构:
sustcsenlp
原始信息汇总

数据集概述

数据集名称

  • Bengali Abstractive News Summarization (BANS)

数据集描述

  • 目的: 为Bengali Abstractive News Summarization提供数据支持。
  • 数据来源: 从在线Bengali新闻门户bangla.bdnews24.com爬取。
  • 数据规模: 包含19,096篇文章及其对应的摘要。
  • 语言: 数据集包含Bangla文本数据。

数据集详细信息

  • 文章数量: 19,096篇
  • 摘要数量: 19,096篇
  • 文章字数范围: 最少5字,最多76字
  • 摘要字数范围: 最少3字,最多12字

许可证

  • 许可证类型: cc0-1.0

数据集使用

  • 加载数据集: 使用from datasets import load_dataset,通过load_dataset("sustcsenlp/bn_news_summarization",split="train")加载训练数据。

贡献者

  • Prithwiraj Bhattacharjee - Shahjalal University of Science and Technology
  • Avi Mallick - Shahjalal University of Science and Technology
  • Md. Saiful Islam - Shahjalal University of Science and Technology
  • Marium-E-Jannat - Shahjalal University of Science and Technology
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作