five

csebuetnlp/dailydialogue_bn

收藏
Hugging Face2023-07-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/csebuetnlp/dailydialogue_bn
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - found multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - extended task_categories: - conversational - text-generation - text2text-generation language: - bn license: - cc-by-nc-sa-4.0 --- # Dataset Card for `dailydialogue_bn` ## Table of Contents - [Dataset Card for `dailydialogue_bn`](#dataset-card-for-dailydialogue_bn) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Usage](#usage) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [https://github.com/csebuetnlp/BanglaNLG](https://github.com/csebuetnlp/BanglaNLG) - **Paper:** [**"BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla"**](https://aclanthology.org/2023.findings-eacl.54/) - **Point of Contact:** [Tahmid Hasan](mailto:tahmidhasan@cse.buet.ac.bd) ### Dataset Summary This is a Multi-turn dialogue dataset for Bengali, curated from the original English [DailyDialogue]() dataset and using the state-of-the-art English to Bengali translation model introduced **[here](https://aclanthology.org/2020.emnlp-main.207/).** ### Supported Tasks and Leaderboards [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Languages * `Bengali` ### Usage ```python from datasets import load_dataset dataset = load_dataset("csebuetnlp/dailydialogue_bn") ``` ## Dataset Structure ### Data Instances One example from the dataset is given below in JSON format. Each element of the `dialogue` feature represents a single turn of the conversation. ``` { "id": "130", "dialogue": [ "তোমার জন্মদিনের জন্য তুমি কি করবে?", "আমি আমার বন্ধুদের সাথে পিকনিক করতে চাই, মা।", "বাড়িতে পার্টি হলে কেমন হয়? এভাবে আমরা একসাথে হয়ে উদযাপন করতে পারি।", "ঠিক আছে, মা। আমি আমার বন্ধুদের বাড়িতে আমন্ত্রণ জানাবো।" ] } ``` ### Data Fields The data fields are as follows: - `id`: a `string` feature. - `dialogue`: a List of `string` feature. ### Data Splits | split |count | |----------|--------| |`train`| 11118 | |`validation`| 1000 | |`test`| 1000 | ## Dataset Creation For the training set, we translated the complete [DailyDialogue](https://aclanthology.org/N18-1101/) dataset using the English to Bangla translation model introduced [here](https://aclanthology.org/2020.emnlp-main.207/). Due to the possibility of incursions of error during automatic translation, we used the [Language-Agnostic BERT Sentence Embeddings (LaBSE)](https://arxiv.org/abs/2007.01852) of the translations and original sentences to compute their similarity. A datapoint was accepted if all of its constituent sentences had a similarity score over 0.7. ### Curation Rationale [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Source Data [DailyDialogue](https://arxiv.org/abs/1606.05250) #### Initial Data Collection and Normalization [More information needed](https://github.com/csebuetnlp/BanglaNLG) #### Who are the source language producers? [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Annotations [More information needed](https://github.com/csebuetnlp/BanglaNLG) #### Annotation process [More information needed](https://github.com/csebuetnlp/BanglaNLG) #### Who are the annotators? [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Personal and Sensitive Information [More information needed](https://github.com/csebuetnlp/BanglaNLG) ## Considerations for Using the Data ### Social Impact of Dataset [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Discussion of Biases [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Other Known Limitations [More information needed](https://github.com/csebuetnlp/BanglaNLG) ## Additional Information ### Dataset Curators [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Licensing Information Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright of the dataset contents belongs to the original copyright holders. ### Citation Information If you use the dataset, please cite the following paper: ``` @inproceedings{bhattacharjee-etal-2023-banglanlg, title = "{B}angla{NLG} and {B}angla{T}5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in {B}angla", author = "Bhattacharjee, Abhik and Hasan, Tahmid and Ahmad, Wasi Uddin and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: EACL 2023", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-eacl.54", pages = "726--735", abstract = "This work presents {`}BanglaNLG,{'} a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain {`}BanglaT5{'}, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9{\%} absolute gain and 32{\%} relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.", } ``` ### Contributions Thanks to [@abhik1505040](https://github.com/abhik1505040) and [@Tahmid](https://github.com/Tahmid04) for adding this dataset.
提供机构:
csebuetnlp
原始信息汇总

数据集卡片 dailydialogue_bn

数据集描述

数据集摘要

这是一个用于孟加拉语的多轮对话数据集,由原始英语DailyDialogue数据集使用最先进的英译孟加拉语翻译模型制作而成。

支持的任务和排行榜

更多信息需要

语言

  • 孟加拉语

使用

python from datasets import load_dataset dataset = load_dataset("csebuetnlp/dailydialogue_bn")

数据集结构

数据实例

以下是数据集中的一个示例,以JSON格式展示。dialogue特征的每个元素代表对话的一个单轮。

json { "id": "130", "dialogue": [ "তোমার জন্মদিনের জন্য তুমি কি করবে?", "আমি আমার বন্ধুদের সাথে পিকনিক করতে চাই, মা।", "বাড়িতে পার্টি হলে কেমন হয়? এভাবে আমরা একসাথে হয়ে উদযাপন করতে পারি।", "ঠিক আছে, মা। আমি আমার বন্ধুদের বাড়িতে আমন্ত্রণ জানাবো।" ] }

数据字段

数据字段如下:

  • id: 一个string特征。
  • dialogue: 一个string列表特征。

数据分割

分割 数量
train 11118
validation 1000
test 1000

数据集创建

训练集是通过使用英译孟加拉语翻译模型翻译完整的DailyDialogue数据集得到的。由于自动翻译过程中可能引入错误,我们使用了Language-Agnostic BERT Sentence Embeddings (LaBSE)来计算翻译和原始句子的相似度。如果一个数据点的所有组成句子相似度得分超过0.7,则接受该数据点。

数据集策展理由

更多信息需要

源数据

DailyDialogue

初始数据收集和规范化

更多信息需要

源语言生产者是谁?

更多信息需要

注释

更多信息需要

注释过程

更多信息需要

注释者是谁?

更多信息需要

个人和敏感信息

更多信息需要

使用数据的注意事项

数据集的社会影响

更多信息需要

偏见的讨论

更多信息需要

其他已知限制

更多信息需要

附加信息

数据集策展人

更多信息需要

许可信息

本仓库的内容仅限于非商业研究目的,遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)。数据集内容的版权属于原始版权持有者。

引用信息

如果您使用该数据集,请引用以下论文:

@inproceedings{bhattacharjee-etal-2023-banglanlg, title = "{B}angla{NLG} and {B}angla{T}5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in {B}angla", author = "Bhattacharjee, Abhik and Hasan, Tahmid and Ahmad, Wasi Uddin and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: EACL 2023", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-eacl.54", pages = "726--735", abstract = "This work presents {}BanglaNLG,{} a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain {}BanglaT5{}, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9{%} absolute gain and 32{%} relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.", }

贡献

感谢@abhik1505040@Tahmid添加此数据集。

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作