castorini/mr-tydi

Name: castorini/mr-tydi
Creator: castorini
Published: 2022-10-12 20:25:19
License: 暂无描述

Hugging Face2022-10-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/castorini/mr-tydi

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar - bn - en - fi - id - fi - ja - ko - ru - sw - te - th multilinguality: - multilingual task_categories: - text-retrieval license: apache-2.0 --- # Dataset Summary Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages. It is designed for monolingual retrieval, specifically to evaluate ranking with learned dense representations. This dataset stores the queries, judgements, and example training data of Mr. TyDi. To access the corpus, please refer to [castorini/mr-tydi-corpus](https://huggingface.co/datasets/castorini/mr-tydi-corpus). # Dataset Structure The only configuration here is the `language`, For each language, there are three splits: `train`, `dev`, and `test`. The negative examples from training set are sampled from the top-30 BM25 runfiles on each language. Specifically, we combine the **training** data for all languages under the `combined` configuration. An example of `train` set looks as follows: ``` { 'query_id': '1', 'query': 'When was quantum field theory developed?', 'positive_passages': [ { 'docid': '25267#12', 'title': 'Quantum field theory', 'text': 'Quantum field theory naturally began with the study of electromagnetic interactions, as the electromagnetic field was the only known classical field as of the 1920s.' }, ... ] 'negative_passages': [ { 'docid': '346489#8', 'title': 'Local quantum field theory', 'text': 'More recently, the approach has been further implemented to include an algebraic version of quantum field ...' }, ... ], } ``` An example of `dev` and `test` set looks as follows. We only provide the docid of positive passages here to save the space. Also no candidate passages are provided at this point. Note that to perform the retrieval, it need to be used together with [castorini/mr-tydi-corpus](https://huggingface.co/datasets/castorini/mr-tydi-corpus) ``` { 'query_id': '0', 'query': 'Is Creole a pidgin of French?', 'positive_passages': [ { 'docid': '3716905#1', 'title': '', 'text': '' }, ... ] } ``` # Load Dataset An example to load the dataset: ``` language = 'english' # to load all train, dev and test sets dataset = load_dataset('castorini/mr-tydi', language) # or to load a specific set: set_name = 'train' dataset = load_dataset('castorini/mr-tydi', language, set_name) ``` Note that the 'combined' option has only the 'train' set. # Citation Information ``` @article{mrtydi, title={{Mr. TyDi}: A Multi-lingual Benchmark for Dense Retrieval}, author={Xinyu Zhang and Xueguang Ma and Peng Shi and Jimmy Lin}, year={2021}, journal={arXiv:2108.08787}, } ```

提供机构：

castorini

原始信息汇总

数据集概述

Mr. TyDi 是一个多语言基准数据集，基于 TyDi 构建，涵盖了十一种类型多样的语言。它专门设计用于单语言检索，特别是评估使用学习到的密集表示进行排序。

数据集结构

数据集的唯一配置是 language。对于每种语言，数据集分为三个部分：train、dev 和 test。训练集中的负例是从每种语言的 top-30 BM25 运行文件中采样的。特别地，所有语言的训练数据被合并到 combined 配置下。

训练集示例

json { "query_id": "1", "query": "When was quantum field theory developed?", "positive_passages": [ { "docid": "25267#12", "title": "Quantum field theory", "text": "Quantum field theory naturally began with the study of electromagnetic interactions, as the electromagnetic field was the only known classical field as of the 1920s." }, ... ], "negative_passages": [ { "docid": "346489#8", "title": "Local quantum field theory", "text": "More recently, the approach has been further implemented to include an algebraic version of quantum field ..." }, ... ] }

dev 和 test 集示例

json { "query_id": "0", "query": "Is Creole a pidgin of French?", "positive_passages": [ { "docid": "3716905#1", "title": "", "text": "" }, ... ] }

数据集加载

加载数据集的示例： python language = english

加载所有 train, dev 和 test 集

dataset = load_dataset(castorini/mr-tydi, language)

或加载特定集

set_name = train dataset = load_dataset(castorini/mr-tydi, language, set_name)

注意，combined 选项只有 train 集。

引用信息

plaintext @article{mrtydi, title={{Mr. TyDi}: A Multi-lingual Benchmark for Dense Retrieval}, author={Xinyu Zhang and Xueguang Ma and Peng Shi and Jimmy Lin}, year={2021}, journal={arXiv:2108.08787}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集