tydi_xor_rc
收藏数据集概述
数据集描述
数据集摘要
- 名称: XORQA Reading Comprehension
- 类型: 多语言阅读理解数据集
- 来源: 结合了XORQA的阅读理解数据和XOR-AttriQA的英语数据。
- 语言: 包含英语和其他7种语言(阿拉伯语、孟加拉语、芬兰语、日语、韩语、俄语、泰卢固语)。
- 任务: 问答(Question Answering),具体为抽取式问答(Extractive QA)。
数据集结构
- 数据集大小: 训练集包含15445个样本,验证集包含3646个样本。
- 数据格式: 可通过
datasets库加载。
数据实例
- 列描述:
lang: 问题语言question: 问题内容context: 英文维基百科段落,可能包含答案answertable: 问题是否可由上下文回答answer_start: 答案在上下文中的起始位置(如果不可回答则为-1)answer: 英文答案,上下文中的文本片段(如果不可回答则为yes或no)answer_inlang: 问题语言的答案(仅部分实例可用)
引用
-
TyDi QA:
@article{clark-etal-2020-tydi, title = "{T}y{D}i {QA}: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages", author = "Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria", journal = "Transactions of the Association for Computational Linguistics", volume = "8", year = "2020", pages = "454--470", }
-
XOR QA:
@inproceedings{asai-etal-2021-xor, title = "{XOR} {QA}: Cross-lingual Open-Retrieval Question Answering", author = "Asai, Akari and Kasai, Jungo and Clark, Jonathan and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", pages = "547--564", }
-
XOR-AttriQA:
@inproceedings{muller-etal-2023-evaluating, title = "Evaluating and Modeling Attribution for Cross-Lingual Question Answering", author = "Muller, Benjamin and Wieting, John and Clark, Jonathan and Kwiatkowski, Tom and Ruder, Sebastian and Soares, Livio and Aharoni, Roee and Herzig, Jonathan and Wang, Xinyi", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", pages = "144--157", }




