castorini/mr-tydi
收藏数据集概述
Mr. TyDi 是一个多语言基准数据集,基于 TyDi 构建,涵盖了十一种类型多样的语言。它专门设计用于单语言检索,特别是评估使用学习到的密集表示进行排序。
数据集结构
数据集的唯一配置是 language。对于每种语言,数据集分为三个部分:train、dev 和 test。训练集中的负例是从每种语言的 top-30 BM25 运行文件中采样的。特别地,所有语言的训练数据被合并到 combined 配置下。
训练集示例
json { "query_id": "1", "query": "When was quantum field theory developed?", "positive_passages": [ { "docid": "25267#12", "title": "Quantum field theory", "text": "Quantum field theory naturally began with the study of electromagnetic interactions, as the electromagnetic field was the only known classical field as of the 1920s." }, ... ], "negative_passages": [ { "docid": "346489#8", "title": "Local quantum field theory", "text": "More recently, the approach has been further implemented to include an algebraic version of quantum field ..." }, ... ] }
dev 和 test 集示例
json { "query_id": "0", "query": "Is Creole a pidgin of French?", "positive_passages": [ { "docid": "3716905#1", "title": "", "text": "" }, ... ] }
数据集加载
加载数据集的示例: python language = english
加载所有 train, dev 和 test 集
dataset = load_dataset(castorini/mr-tydi, language)
或加载特定集
set_name = train dataset = load_dataset(castorini/mr-tydi, language, set_name)
注意,combined 选项只有 train 集。
引用信息
plaintext @article{mrtydi, title={{Mr. TyDi}: A Multi-lingual Benchmark for Dense Retrieval}, author={Xinyu Zhang and Xueguang Ma and Peng Shi and Jimmy Lin}, year={2021}, journal={arXiv:2108.08787}, }



