five

SEACrowd/tydiqa_id

收藏
Hugging Face2023-09-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SEACrowd/tydiqa_id
下载链接
链接失效反馈
官方服务:
资源简介:
TyDiQA数据集是从维基百科文章中收集的,包含11种语言的人工标注的问题-答案对。IndoNLG使用了TyDiQA数据集中印尼语的二级Gold passage任务数据,并随机划分了15%的训练数据作为测试集。
提供机构:
SEACrowd
原始信息汇总

tydiqa_id

数据集概述

  • 名称:tydiqa_id
  • 来源:从维基百科文章中收集,包含人工标注的问题和答案对,涵盖11种语言。
  • 语言:印尼语(ind)
  • 数据处理:IndoNLG使用原始TyDiQA数据集的印尼语次要黄金段落任务数据,并随机抽取15%的训练数据作为测试集。

使用方法

  • 安装依赖:运行 pip install nusacrowd
  • 加载数据集:通过HuggingFace的 load_dataset 方法加载数据集。

引用

@article{clark-etal-2020-tydi, title = "{T}y{D}i {QA}: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages", author = "Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria", journal = "Transactions of the Association for Computational Linguistics", volume = "8", year = "2020", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2020.tacl-1.30", doi = "10.1162/tacl_a_00317", pages = "454--470", }

@inproceedings{cahyawijaya-etal-2021-indonlg, title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation", author = "Cahyawijaya, Samuel and Winata, Genta Indra and Wilie, Bryan and Vincentio, Karissa and Li, Xiaohong and Kuncoro, Adhiguna and Ruder, Sebastian and Lim, Zhi Yuan and Bahar, Syafri and Khodra, Masayu and Purwarianti, Ayu and Fung, Pascale", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.699", doi = "10.18653/v1/2021.emnlp-main.699", pages = "8875--8898" }

许可证

  • 类型:Creative Common Attribution Share-Alike 4.0 International
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作