SEACrowd/tydiqa_id
收藏tydiqa_id
数据集概述
- 名称:tydiqa_id
- 来源:从维基百科文章中收集,包含人工标注的问题和答案对,涵盖11种语言。
- 语言:印尼语(ind)
- 数据处理:IndoNLG使用原始TyDiQA数据集的印尼语次要黄金段落任务数据,并随机抽取15%的训练数据作为测试集。
使用方法
- 安装依赖:运行
pip install nusacrowd。 - 加载数据集:通过HuggingFace的
load_dataset方法加载数据集。
引用
@article{clark-etal-2020-tydi, title = "{T}y{D}i {QA}: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages", author = "Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria", journal = "Transactions of the Association for Computational Linguistics", volume = "8", year = "2020", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2020.tacl-1.30", doi = "10.1162/tacl_a_00317", pages = "454--470", }
@inproceedings{cahyawijaya-etal-2021-indonlg, title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation", author = "Cahyawijaya, Samuel and Winata, Genta Indra and Wilie, Bryan and Vincentio, Karissa and Li, Xiaohong and Kuncoro, Adhiguna and Ruder, Sebastian and Lim, Zhi Yuan and Bahar, Syafri and Khodra, Masayu and Purwarianti, Ayu and Fung, Pascale", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.699", doi = "10.18653/v1/2021.emnlp-main.699", pages = "8875--8898" }
许可证
- 类型:Creative Common Attribution Share-Alike 4.0 International



