Ichsan2895/OASST_Top1_Indonesian
收藏数据集概述
许可证
- 该数据集遵循 CC BY-SA 4.0 许可证。
语言
- 该数据集包含印度尼西亚语(id)和英语(en)。
数据规模
- 数据集大小介于1K到10K之间。
任务类别
- 该数据集适用于问答任务(question-answering)。
数据来源
- 基础数据集来自 OpenAssistant/oasst1。
- 选择了英语语言且排名第一的数据进行处理。
数据处理
- 数据集被翻译成印度尼西亚语,使用了 Marian NMT 和预训练模型 Helsinki-NLP/opus-mt-en-id。
引用
@InProceedings{mariannmt, title = {Marian: Fast Neural Machine Translation in {C++}}, author = {Junczys-Dowmunt, Marcin and Grundkiewicz, Roman and Dwojak, Tomasz and Hoang, Hieu and Heafield, Kenneth and Neckermann, Tom and Seide, Frank and Germann, Ulrich and Fikri Aji, Alham and Bogoychev, Nikolay and Martins, Andr{e} F. T. and Birch, Alexandra}, booktitle = {Proceedings of ACL 2018, System Demonstrations}, pages = {116--121}, publisher = {Association for Computational Linguistics}, year = {2018}, month = {July}, address = {Melbourne, Australia}, url = {http://www.aclweb.org/anthology/P18-4020} }
@InProceedings{TiedemannThottingal:EAMT2020, author = {J{"o}rg Tiedemann and Santhosh Thottingal}, title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld}, booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)}, year = {2020}, address = {Lisbon, Portugal} }



