csebuetnlp/BanglaNMT
收藏数据集概述
数据集名称
- 名称: BanglaNMT
数据集摘要
- 摘要: 这是最大的Bengali-English机器翻译(MT)数据集,使用新颖的句子对齐方法进行筛选。这是作者用于NMT训练的原始数据集的过滤版本。
支持的任务和排行榜
- 信息: 需要更多信息。
语言
- 语言: Bengali, English
使用示例
python from datasets import load_dataset dataset = load_dataset("csebuetnlp/BanglaNMT")
数据集结构
数据实例
- 示例: json { bn: বিমানবন্দরে যুক্তরাজ্যে নিযুক্ত বাংলাদেশ হাইকমিশনার সাঈদা মুনা তাসনীম ও লন্ডনে বাংলাদেশ মিশনের জ্যেষ্ঠ কর্মকর্তারা তাকে বিদায় জানান।, en: Bangladesh High Commissioner to the United Kingdom Saida Muna Tasneen and senior officials of Bangladesh Mission in London saw him off at the airport. }
数据字段
- 字段:
bn: 表示Bengali句子的字符串特征。en: 表示英文翻译的字符串特征。
数据分割
| 分割 | 计数 |
|---|---|
train |
2379749 |
validation |
597 |
test |
1000 |
数据集创建
许可证信息
- 许可证: 本仓库内容仅限于非商业研究目的,根据Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)。数据集内容的版权属于原始版权持有者。
引用信息
-
引用:
@inproceedings{hasan-etal-2020-low, title = "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for {B}engali-{E}nglish Machine Translation", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Samin, Kazi and Hasan, Masum and Basak, Madhusudan and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.207", doi = "10.18653/v1/2020.emnlp-main.207", pages = "2612--2623", abstract = "...", }
贡献者
- 贡献者: @abhik1505040, @Tahmid04



