five

orgrctera/beir_msmarco

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/orgrctera/beir_msmarco
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other tags: - retrieval - text-retrieval - beir - ms-marco - passage-ranking - information-retrieval - benchmark pretty_name: BEIR MS MARCO (retrieval) size_categories: - "100K<n<1M" task_categories: - text-retrieval --- # MS MARCO (BEIR) — Passage retrieval ## Dataset description **MS MARCO** (MicroSoft MAchine Reading COmprehension) is a large-scale resource built from real **Bing** search queries, web documents, and human-authored answers. It has become a standard benchmark for **neural information retrieval** and **passage ranking** under abundant training data: hundreds of thousands of queries with human relevance signals, aligned with realistic web search behavior. **BEIR** (Benchmarking IR) is a heterogeneous benchmark for **zero-shot** evaluation of retrieval models across many domains and tasks. The BEIR release repackages **MS MARCO passage ranking**—queries, the official passage corpus identifiers, and **qrels**—in a unified format so that MS MARCO can be evaluated alongside 15+ other datasets with the same tooling and metrics. This repository (`orgrctera/beir_msmarco`) provides **train**, **dev**, and **test** splits in **Parquet** form for CTERA-style retrieval evaluation. Each row is one **query** with **relevance judgments** pointing at **MS MARCO passage IDs** (the same string IDs used in the official `collection.tsv` and BEIR distributions). ### Splits in this dataset | Split | Rows (this repo) | Role | |--------|------------------|------| | `train` | 502,939 | Large-scale supervision (typically one or more labeled passage IDs per query; scores follow MS MARCO / BEIR conventions). | | `dev` | 6,980 | Standard **BEIR MS MARCO dev** evaluation set (queries with qrels; aligns with common BEIR leaderboard setups). | | `test` | 43 | Small held-out evaluation with **candidate lists** and **graded** relevance scores (0–3) in `expected_output` for reranking-style or multi-candidate evaluation. | **Corpus:** MS MARCO uses a fixed corpus of ~8.8M short **passages**. Full retrieval experiments require joining passage IDs to text via the official [MS MARCO passage collection](https://microsoft.github.io/msmarco/Datasets.html) or a BEIR mirror (e.g. [`BeIR/msmarco`](https://huggingface.co/datasets/BeIR/msmarco) on the Hub). ## Task: retrieval (MS MARCO passage ranking, BEIR version) The task is **ad hoc passage retrieval** (passage ranking): 1. **Input:** a natural-language **query** (information need from real search logs). 2. **Output:** a ranked list of **passage IDs** from the MS MARCO collection, or scores over the full index, such that **relevant** passages—according to the official qrels—receive high rank. Downstream metrics are standard IR metrics used in MS MARCO and BEIR (e.g. **MRR@10**, **nDCG@10**, **Recall@k**), depending on the evaluation script and whether you run **full retrieval** or **reranking** over a candidate set. > **Note:** This Hub dataset stores **queries and labels** (passage IDs + scores). You still need the **passage corpus** keyed by the same IDs for indexing, embedding, or BM25 baselines. ## Data format (this repository) Each record includes: | Field | Description | |--------|-------------| | `id` | UUID for this example row. | | `input` | The **query** text. | | `expected_output` | JSON string: list of objects `{"id": "<passage-pid>", "score": <relevance>}`. Passage IDs are **MS MARCO passage pids** (strings). **Train/dev** often use binary relevance (`1` = relevant); **test** rows in this repo include graded scores (0–3) over a candidate list. | | `metadata.query_id` | Original MS MARCO / BEIR query identifier (string). | | `metadata.split` | Split name: `train`, `dev`, or `test`. | ## Examples ### Example 1 (dev — single relevant passage) ```json { "id": "c61c30e4-d7dc-464a-b168-2f85a11f896c", "input": "how many years did william bradford serve as governor of plymouth colony?", "expected_output": "[{\"id\": \"7067032\", \"score\": 1}]", "metadata.query_id": "300674", "metadata.split": "dev" } ``` ### Example 2 (train — query with one labeled passage) ```json { "id": "0e8bc96c-daa7-467d-b28a-180893c6946c", "input": "temperature sorrento italy september", "expected_output": "[{\"id\": \"14410\", \"score\": 1}]", "metadata.query_id": "512836", "metadata.split": "train" } ``` ## References ### MS MARCO (original dataset) **Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang.** *MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.* arXiv:1611.09268, 2016. **Abstract (short):** MS MARCO introduces a large-scale machine reading comprehension dataset built from real Bing queries, with human-generated answers and millions of passages from retrieved web documents. The work defines multiple tasks of varying difficulty, including **passage ranking**—establishing MS MARCO as a benchmark for realistic, large-scale QA and IR research. - Paper: [arXiv:1611.09268](https://arxiv.org/abs/1611.09268) ```bibtex @article{bajaj2016ms, title={Ms marco: A human generated machine reading comprehension dataset}, author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others}, journal={arXiv preprint arXiv:1611.09268}, year={2016} } ``` ### MS MARCO ranking evaluation (leaderboards & validity) **Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin.** *MS MARCO: Benchmarking Ranking Models in the Large-Data Regime.* arXiv:2105.04021, 2021. **Abstract (from arXiv):** *“Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there.”* - Paper: [arXiv:2105.04021](https://arxiv.org/abs/2105.04021) ### BEIR benchmark (MS MARCO as a subset) **Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych.** *BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.* NeurIPS 2021 (Datasets and Benchmarks Track). **Abstract (from arXiv):** *“Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities.”* - Paper: [arXiv:2104.08663](https://arxiv.org/abs/2104.08663) — [OpenReview](https://openreview.net/forum?id=wCu6T5xFjeJ); code and data: [BEIR on GitHub](https://github.com/beir-cellar/beir). ```bibtex @article{thakur2021beir, title={BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models}, author={Thakur, Nandan and Reimers, Nils and R{\"u}ckl{\'e}, Andreas and Srivastava, Abhishek and Gurevych, Iryna}, journal={arXiv preprint arXiv:2104.08663}, year={2021} } ``` ### Official resources - [MS MARCO — Datasets for Document and Passage Ranking](https://microsoft.github.io/msmarco/Datasets.html) — corpus, qrels, terms of use. - [MSMARCO-Passage-Ranking (GitHub)](https://github.com/microsoft/MSMARCO-Passage-Ranking) ### Related Hugging Face mirrors (BEIR layout) - [`BeIR/msmarco`](https://huggingface.co/datasets/BeIR/msmarco) — corpus / queries / qrels in classic BEIR form. - [`irds/beir_msmarco_dev`](https://huggingface.co/datasets/irds/beir_msmarco_dev) — ir-datasets packaging of BEIR MS MARCO dev. ## License and terms The **underlying MS MARCO data** is subject to Microsoft’s **terms for non-commercial research** as published on the [official MS MARCO Datasets page](https://microsoft.github.io/msmarco/Datasets.html). Review **Terms and Conditions** before use in products or redistribution. This card marks **`license: other`** to reflect MS MARCO’s upstream terms (not a single SPDX code). When publishing results, cite **MS MARCO** and **BEIR** as appropriate. ## Citation If you use **MS MARCO**, cite Bajaj et al. (2016). If you use the **BEIR** benchmark formulation, cite Thakur et al. (2021). BibTeX for BEIR is also available in the [official BEIR repository](https://github.com/beir-cellar/beir). --- *Dataset card maintained for the `orgrctera/beir_msmarco` Hub repository.*
提供机构:
orgrctera
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作