orgrctera/msmarco_document_ranking
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/orgrctera/msmarco_document_ranking
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-retrieval
language:
- en
tags:
- information-retrieval
- document-ranking
- msmarco
- benchmark
- retrieval
pretty_name: MS MARCO Document Ranking (CTERA format)
size_categories:
- "100K<n<1M"
---
# MS MARCO Document Ranking
## Dataset description and background
**MS MARCO** (MicroSoft MAchine Reading COmprehension) is a large-scale collection originally introduced for machine reading comprehension and question answering. Over time it has become a standard benchmark for **information retrieval** under abundant training data: hundreds of thousands of queries with human relevance signals, aligned with real web search behavior.
The **document ranking** track uses a **document-level corpus** derived from the same ecosystem as MS MARCO passage ranking. The official resource describes a corpus on the order of **~3.2M documents**, with **training queries on the order of hundreds of thousands**, **development** and **leaderboard test** query sets, and **TREC-style qrels** (query–relevance judgments) for training and development. For training, **passage-level labels are mapped to document IDs** under the assumption that a document containing a judged-relevant passage is treated as a relevant document—supporting transfer between passage-focused and document-focused retrieval research.
This Hugging Face dataset is a **CTERA-packaged view** of that task: each row pairs a **natural-language query** with **structured labels/metadata** suitable for retrieval and RAG benchmarking (see [Data fields](#data-fields) below).
**Official MS MARCO ranking resources (corpus, qrels, leaderboards):** [MS MARCO — Datasets for Document and Passage Ranking](https://microsoft.github.io/msmarco/Datasets.html)
## Task: document ranking / retrieval
This dataset supports **ad-hoc document retrieval / ranking**: given a **query**, a system should **rank documents** from a collection by **relevance**. In research settings this is often split into:
- **Full ranking (retrieval):** score or retrieve from the **full document collection** (official submissions allow a bounded number of documents per query, e.g. up to **100** in the MS MARCO document ranking setup).
- **Top‑k reranking:** rerank a fixed candidate list (e.g. **top‑100** candidates from a first-stage retriever)—a common production pattern of “retrieve, then rerank.”
**Evaluation** in the MS MARCO ranking leaderboards is typically reported with **MRR@10** (Mean Reciprocal Rank at rank 10) for the document ranking task, alongside standard TREC-style analyses where applicable. (Consult the official leaderboard and TREC Deep Learning track materials for the exact metric definitions used in a given campaign.)
## Data fields
Parquet splits expose three columns:
| Column | Description |
|--------|-------------|
| `input` | The query text (string). |
| `expected_output` | JSON string with relevance information; **format differs by split** (see examples). |
| `metadata` | JSON string with identifiers and benchmark tags (`benchmark_name`, `split`, `query_id`, etc.). |
**Splits:** `train`, `dev`, and `test` are provided. The **test** split may use **held-out** or **empty** labels in `expected_output` for blind evaluation workflows—check the sample below.
## Examples
**Train** (`expected_output` includes query id and judged relevant document id(s) as JSON):
```json
{
"input": ")what was the immediate impact of the success of the manhattan project?",
"expected_output": "{\"qid\": \"1185869\", \"qrels\": [\"D59219\"]}",
"metadata": "{\"query_id\": \"1185869\", \"split\": \"train\", \"benchmark_name\": \"msmarco_document_ranking\", \"benchmark_type\": \"base_rag\", \"sub_benchmark\": \"document_ranking\"}"
}
```
**Dev** (`expected_output` is a JSON-encoded list of relevant document ids):
```json
{
"input": "does xpress bet charge to deposit money in your account",
"expected_output": "[\"D1987644\"]",
"metadata": "{\"query_id\": \"174249\", \"split\": \"dev\", \"benchmark_name\": \"msmarco_document_ranking\", \"benchmark_type\": \"base_rag\", \"sub_benchmark\": \"document_ranking\"}"
}
```
**Test** (labels may be withheld; example shows an empty list):
```json
{
"input": "how to display how.close you are to.cell.tower",
"expected_output": "[]",
"metadata": "{\"query_id\": \"355339\", \"split\": \"test\", \"benchmark_name\": \"msmarco_document_ranking\", \"benchmark_type\": \"base_rag\", \"sub_benchmark\": \"document_ranking\"}"
}
```
## References
### Foundational MS MARCO paper (cite when using MS MARCO-derived data)
**Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang.** *MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.* arXiv:1611.09268, 2016.
**Abstract (short):** MS MARCO introduces a large-scale machine reading comprehension dataset built from real Bing queries, with human-generated answers and millions of passages from retrieved web documents. The work defines multiple tasks of varying difficulty, including passage ranking—establishing MS MARCO as a benchmark for realistic, large-scale QA and IR research.
- Paper: [arXiv:1611.09268](https://arxiv.org/abs/1611.09268)
```bibtex
@article{bajaj2016ms,
title={Ms marco: A human generated machine reading comprehension dataset},
author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others},
journal={arXiv preprint arXiv:1611.09268},
year={2016}
}
```
### Official dataset and leaderboard documentation
- [MS MARCO — Datasets for Document and Passage Ranking Leaderboards](https://microsoft.github.io/msmarco/Datasets.html) — corpus files, qrels, submission formats, and task description.
- [MS MARCO — Submission / evaluation](https://microsoft.github.io/msmarco/Submission.html) — ranking submission conventions.
- [TREC Deep Learning Track](https://microsoft.github.io/msmarco/TREC-Deep-Learning) — blind evaluation and community benchmarks related to MS MARCO ranking tasks.
### Related code and corpora (Microsoft)
- [microsoft/MSMARCO-Document-Ranking](https://github.com/microsoft/MSMARCO-Document-Ranking) — pointers and tooling around the document ranking collection.
## Terms and licensing
The **underlying MS MARCO data** is subject to Microsoft’s **terms for non-commercial research** as published on the official MS MARCO site; review the **Terms and Conditions** on [the official datasets page](https://microsoft.github.io/msmarco/Datasets.html) before use in products or redistributions.
This Hugging Face dataset card describes the **CTERA-formatted** release; when publishing results, cite **MS MARCO** appropriately and follow the original **license / usage** constraints for the source corpus and judgments.
## Acknowledgments
Dataset packaging for this repository is maintained by **CTERA**. **MS MARCO** is provided by **Microsoft** and the broader IR community; see the official site for credits and contact information.
提供机构:
orgrctera



