csebuetnlp/dailydialogue_bn

Name: csebuetnlp/dailydialogue_bn
Creator: csebuetnlp
Published: 2023-07-22 07:41:50
License: 暂无描述

Hugging Face2023-07-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/csebuetnlp/dailydialogue_bn

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - found multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - extended task_categories: - conversational - text-generation - text2text-generation language: - bn license: - cc-by-nc-sa-4.0 --- # Dataset Card for `dailydialogue_bn` ## Table of Contents - [Dataset Card for `dailydialogue_bn`](#dataset-card-for-dailydialogue_bn) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Usage](#usage) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [https://github.com/csebuetnlp/BanglaNLG](https://github.com/csebuetnlp/BanglaNLG) - **Paper:** [**"BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla"**](https://aclanthology.org/2023.findings-eacl.54/) - **Point of Contact:** [Tahmid Hasan](mailto:tahmidhasan@cse.buet.ac.bd) ### Dataset Summary This is a Multi-turn dialogue dataset for Bengali, curated from the original English [DailyDialogue]() dataset and using the state-of-the-art English to Bengali translation model introduced **[here](https://aclanthology.org/2020.emnlp-main.207/).** ### Supported Tasks and Leaderboards [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Languages * `Bengali` ### Usage ```python from datasets import load_dataset dataset = load_dataset("csebuetnlp/dailydialogue_bn") ``` ## Dataset Structure ### Data Instances One example from the dataset is given below in JSON format. Each element of the `dialogue` feature represents a single turn of the conversation. ``` { "id": "130", "dialogue": [ "তোমার জন্মদিনের জন্য তুমি কি করবে?", "আমি আমার বন্ধুদের সাথে পিকনিক করতে চাই, মা।", "বাড়িতে পার্টি হলে কেমন হয়? এভাবে আমরা একসাথে হয়ে উদযাপন করতে পারি।", "ঠিক আছে, মা। আমি আমার বন্ধুদের বাড়িতে আমন্ত্রণ জানাবো।" ] } ``` ### Data Fields The data fields are as follows: - `id`: a `string` feature. - `dialogue`: a List of `string` feature. ### Data Splits | split |count | |----------|--------| |`train`| 11118 | |`validation`| 1000 | |`test`| 1000 | ## Dataset Creation For the training set, we translated the complete [DailyDialogue](https://aclanthology.org/N18-1101/) dataset using the English to Bangla translation model introduced [here](https://aclanthology.org/2020.emnlp-main.207/). Due to the possibility of incursions of error during automatic translation, we used the [Language-Agnostic BERT Sentence Embeddings (LaBSE)](https://arxiv.org/abs/2007.01852) of the translations and original sentences to compute their similarity. A datapoint was accepted if all of its constituent sentences had a similarity score over 0.7. ### Curation Rationale [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Source Data [DailyDialogue](https://arxiv.org/abs/1606.05250) #### Initial Data Collection and Normalization [More information needed](https://github.com/csebuetnlp/BanglaNLG) #### Who are the source language producers? [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Annotations [More information needed](https://github.com/csebuetnlp/BanglaNLG) #### Annotation process [More information needed](https://github.com/csebuetnlp/BanglaNLG) #### Who are the annotators? [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Personal and Sensitive Information [More information needed](https://github.com/csebuetnlp/BanglaNLG) ## Considerations for Using the Data ### Social Impact of Dataset [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Discussion of Biases [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Other Known Limitations [More information needed](https://github.com/csebuetnlp/BanglaNLG) ## Additional Information ### Dataset Curators [More information needed](https://github.com/csebuetnlp/BanglaNLG) ### Licensing Information Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright of the dataset contents belongs to the original copyright holders. ### Citation Information If you use the dataset, please cite the following paper: ``` @inproceedings{bhattacharjee-etal-2023-banglanlg, title = "{B}angla{NLG} and {B}angla{T}5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in {B}angla", author = "Bhattacharjee, Abhik and Hasan, Tahmid and Ahmad, Wasi Uddin and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: EACL 2023", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-eacl.54", pages = "726--735", abstract = "This work presents {`}BanglaNLG,{'} a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain {`}BanglaT5{'}, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9{\%} absolute gain and 32{\%} relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.", } ``` ### Contributions Thanks to [@abhik1505040](https://github.com/abhik1505040) and [@Tahmid](https://github.com/Tahmid04) for adding this dataset.

提供机构：

csebuetnlp

原始信息汇总

数据集卡片 `dailydialogue_bn`

数据集描述

数据集摘要

这是一个用于孟加拉语的多轮对话数据集，由原始英语DailyDialogue数据集使用最先进的英译孟加拉语翻译模型制作而成。

支持的任务和排行榜

更多信息需要

语言

孟加拉语

使用

python from datasets import load_dataset dataset = load_dataset("csebuetnlp/dailydialogue_bn")

数据集结构

数据实例

以下是数据集中的一个示例，以JSON格式展示。dialogue特征的每个元素代表对话的一个单轮。

json { "id": "130", "dialogue": [ "তোমার জন্মদিনের জন্য তুমি কি করবে?", "আমি আমার বন্ধুদের সাথে পিকনিক করতে চাই, মা।", "বাড়িতে পার্টি হলে কেমন হয়? এভাবে আমরা একসাথে হয়ে উদযাপন করতে পারি।", "ঠিক আছে, মা। আমি আমার বন্ধুদের বাড়িতে আমন্ত্রণ জানাবো।" ] }

数据字段

数据字段如下：

id: 一个string特征。
dialogue: 一个string列表特征。

数据分割

分割	数量
`train`	11118
`validation`	1000
`test`	1000

数据集创建

训练集是通过使用英译孟加拉语翻译模型翻译完整的DailyDialogue数据集得到的。由于自动翻译过程中可能引入错误，我们使用了Language-Agnostic BERT Sentence Embeddings (LaBSE)来计算翻译和原始句子的相似度。如果一个数据点的所有组成句子相似度得分超过0.7，则接受该数据点。

使用数据的注意事项

数据集的社会影响

更多信息需要

偏见的讨论

更多信息需要

其他已知限制

更多信息需要

附加信息

数据集策展人

更多信息需要

许可信息

本仓库的内容仅限于非商业研究目的，遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)。数据集内容的版权属于原始版权持有者。

引用信息

如果您使用该数据集，请引用以下论文：

@inproceedings{bhattacharjee-etal-2023-banglanlg, title = "{B}angla{NLG} and {B}angla{T}5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in {B}angla", author = "Bhattacharjee, Abhik and Hasan, Tahmid and Ahmad, Wasi Uddin and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: EACL 2023", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-eacl.54", pages = "726--735", abstract = "This work presents {}BanglaNLG,{} a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain {}BanglaT5{}, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9{%} absolute gain and 32{%} relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.", }

贡献

感谢@abhik1505040和@Tahmid添加此数据集。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集

csebuetnlp/dailydialogue_bn

数据集卡片 dailydialogue_bn

数据集描述

数据集摘要

支持的任务和排行榜

语言

使用

数据集结构

数据实例

数据字段

数据分割

数据集创建

数据集策展理由

源数据

初始数据收集和规范化

源语言生产者是谁？

注释

注释过程

注释者是谁？

个人和敏感信息

使用数据的注意事项

数据集的社会影响

偏见的讨论

其他已知限制

附加信息

数据集策展人

许可信息

引用信息

贡献

数据集卡片 `dailydialogue_bn`