orgrctera/msmarco_passage_ranking
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/orgrctera/msmarco_passage_ranking
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
tags:
- retrieval
- passage-ranking
- information-retrieval
- ms-marco
- benchmark
pretty_name: MS MARCO Passage Ranking (CTERA RAG benchmark)
size_categories:
- "10K<n<100K"
---
# MS MARCO Passage Ranking
## Dataset description
[MS MARCO](https://microsoft.github.io/msmarco/) (**M**icro**S**oft **MA**chine **R**eading **CO**mprehension) is a large-scale collection built for machine reading comprehension and information retrieval research. The original release introduced more than one million real user questions sampled from Bing search logs, paired with passages drawn from web documents, and human-authored answers where applicable.
The **passage ranking** track uses a fixed corpus of short text passages and asks systems to identify which passages are likely to *contain an answer* to a natural-language query. It is one of the most widely used benchmarks for training and evaluating **dense retrieval**, **late interaction**, and **reranking** models in open-domain QA and neural IR.
This Hugging Face dataset (`orgrctera/msmarco_passage_ranking`) repackages a **dev** subset of the MS MARCO passage-ranking task into a simple tabular format for **CTERA AI RAG** evaluation (`benchmark_type: base_rag`). Each row is one query; labels reference **passage IDs** (`pid`) from the official MS MARCO passage collection, not the passage text inline.
For the authoritative corpus files, qrels, and leaderboard definitions, see the [MS MARCO ranking datasets page](https://microsoft.github.io/msmarco/Datasets.html) and the [MSMARCO-Passage-Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) GitHub repository.
## Task: passage ranking / retrieval
**Passage ranking** is an information retrieval task:
- **Input:** a query string (often an information need phrased as a question or short phrase).
- **Output:** a ranking of passages from a large collection by estimated relevance—specifically, the likelihood that a passage *contains information sufficient to answer* the query.
- **Typical setup:** models score query–passage pairs or build indexable embeddings over the passage collection; **full retrieval** ranks the entire corpus, while **reranking** reorders a smaller candidate list (e.g., top-1000 BM25 hits).
Standard leaderboard metrics for MS MARCO passage ranking include **MRR@10** (mean reciprocal rank in the top 10) for ranked lists. Training data in the original release includes human relevance judgments (qrels) and, in many pipelines, **triples** (query, positive passage, negative passage) for contrastive learning.
**Important:** This Hub dataset stores **expected passage IDs** as the supervision signal. Full RAG or retrieval experiments still require joining `pid` to passage text via the official `collection.tsv` (or another mirror of the MS MARCO passage corpus).
## Data format (this repository)
Parquet files with string columns:
| Column | Description |
|--------|-------------|
| `input` | The query text shown to the system. |
| `expected_output` | JSON array of relevant **passage IDs** as strings, e.g. `["7187234"]`. |
| `metadata` | JSON object with fields such as `query_id`, `split`, `benchmark_name`, `benchmark_type`, and `sub_benchmark`. |
Splits and row counts follow the files published under `data/` in this dataset repository (e.g., dev split on the Hub).
## Examples
**Example 1**
- `input`: `cost of endless pools/swim spa`
- `expected_output`: `["7187234"]`
- `metadata` (illustrative): `{"query_id": "1048578", "split": "dev", "benchmark_name": "msmarco_passage_ranking", "benchmark_type": "base_rag", "sub_benchmark": "passage_ranking"}`
**Example 2**
- `input`: `what is pcnt`
- `expected_output`: `["7187227"]`
- `metadata` (illustrative): `{"query_id": "1048579", "split": "dev", "benchmark_name": "msmarco_passage_ranking", "benchmark_type": "base_rag", "sub_benchmark": "passage_ranking"}`
## References and further reading
### MS MARCO (original dataset)
**MS MARCO: A Human Generated MAchine Reading COmprehension Dataset** — Bajaj et al., 2016. [arXiv:1611.09268](https://arxiv.org/abs/1611.09268)
> Abstract (abridged): We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions—sampled from Bing's search query logs—each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages—extracted from 3,563,535 web documents retrieved by Bing—that provide the information necessary for curating the natural language answers. … Using this dataset, we propose three different tasks with varying levels of difficulty: … (iii) rank a set of retrieved passages given a question.
BibTeX (from the [MS MARCO citation page](https://microsoft.github.io/msmarco/Datasets.html)):
```bibtex
@article{bajaj2016ms,
title={Ms marco: A human generated machine reading comprehension dataset},
author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others},
journal={arXiv preprint arXiv:1611.09268},
year={2016}
}
```
### Passage ranking / neural baselines
**An Updated Duet Model for Passage Re-ranking** — Mitra & Craswell, 2019. [arXiv:1903.07666](https://arxiv.org/abs/1903.07666)
> Abstract: We propose several small modifications to Duet—a deep neural ranking model—and evaluate the updated model on the MS MARCO passage ranking task. We report significant improvements from the proposed changes based on an ablation study.
### Official resources
- [MS MARCO — Datasets for document and passage ranking leaderboards](https://microsoft.github.io/msmarco/Datasets.html)
- [TREC Deep Learning Track](https://microsoft.github.io/msmarco/TREC-Deep-Learning) (blind evaluation using MS MARCO-style ranking tasks)
- [MSMARCO-Passage-Ranking (GitHub)](https://github.com/microsoft/MSMARCO-Passage-Ranking)
## License and terms
The underlying MS MARCO data is subject to the [terms and conditions](https://microsoft.github.io/msmarco/Datasets.html#terms-and-conditions) stated by Microsoft (non-commercial research use; see the official site for details). When publishing work that uses MS MARCO or derivatives, cite the MS MARCO paper above and comply with the original license and usage restrictions.
## Citation
If you use this Hugging Face dataset, cite **MS MARCO** (Bajaj et al., 2016) and acknowledge this derivative packaging as appropriate for your publication.
---
语言:
- 英语
许可证:其他
标签:
- 信息检索
- 篇章排序
- 信息检索
- ms-marco
- 基准测试集
友好名称:MS MARCO篇章排序(CTERA RAG基准测试集)
规模类别:
- "10K<n<100K"
---
# MS MARCO篇章排序
## 数据集描述
MS MARCO(**M**icrosoft **MA**chine **R**eading **CO**mprehension,微软机器阅读理解数据集)是为机器阅读理解与信息检索研究构建的大规模语料库。原始版本发布了超过100万条从必应搜索日志中采样的真实用户查询,搭配从网页文档中抽取的篇章,以及适用场景下的人工撰写答案。
**篇章排序**赛道采用固定的短语文本篇章语料库,要求系统识别哪些篇章可能包含自然语言查询的答案。该赛道是开放域问答与神经信息检索领域中,用于训练和评估**稠密检索(dense retrieval)**、**晚交互(late interaction)**与**重排序(reranking)**模型的最广泛使用的基准测试集之一。
本Hugging Face数据集(`orgrctera/msmarco_passage_ranking`)将MS MARCO篇章排序任务的**开发集(dev)子集**重新打包为简洁的表格格式,用于**CTERA AI RAG**评估(`benchmark_type: base_rag`)。每一行对应一个查询,标签引用的是官方MS MARCO篇章语料库中的**篇章ID(passage IDs,pid)**,而非内嵌的篇章文本。
如需获取官方语料文件、相关性标注集(qrels)与排行榜定义,请参阅[MS MARCO排序数据集页面](https://microsoft.github.io/msmarco/Datasets.html)与[MSMARCO-Passage-Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) GitHub仓库。
## 任务:篇章排序/信息检索
篇章排序是一类信息检索任务:
- **输入**:查询字符串(通常为表述为问题或简短短语的信息需求)。
- **输出**:按估计相关性排序的篇章列表——具体而言,即篇章包含足够回答该查询的信息的可能性。
- **典型设置**:模型对查询-篇章对进行评分,或为篇章语料库构建可索引的嵌入;**全量检索(full retrieval)**对整个语料库进行排序,而**重排序(reranking)**则对较小的候选列表(例如前1000个BM25检索结果)进行重新排序。
MS MARCO篇章排序的标准排行榜指标包括**MRR@10(前10名平均倒数排名)**。原始版本的训练数据包含人工相关性标注(qrels),且在多数流水线中,用于对比学习的**三元组(triples)**(查询、正例篇章、负例篇章)。
**重要提示**:本Hub数据集存储的**预期篇章ID**作为监督信号。完整的RAG或检索实验仍需通过官方`collection.tsv`(或MS MARCO篇章语料库的其他镜像)将`pid`与篇章文本进行关联。
## 本仓库数据格式
Parquet文件格式,包含字符串列:
| 列名 | 描述 |
|------|------|
| `input` | 提供给系统的查询文本。 |
| `expected_output` | 以JSON数组形式存储的相关**篇章ID(passage IDs)**字符串,例如`["7187234"]`。 |
| `metadata` | 包含`query_id`、`split`、`benchmark_name`、`benchmark_type`与`sub_benchmark`等字段的JSON对象。 |
分割方式与行数统计遵循本数据集仓库`data/`目录下发布的文件(例如Hub上的开发集分割)。
## 示例
**示例1**
- `input`: `cost of endless pools/swim spa` → 无边泳池/游泳水疗池的造价
- `expected_output`: `["7187234"]`
- `metadata`(示例): `{"query_id": "1048578", "split": "开发集", "benchmark_name": "msmarco_passage_ranking", "benchmark_type": "base_rag", "sub_benchmark": "passage_ranking"}`
**示例2**
- `input`: `what is pcnt` → pcnt是什么
- `expected_output`: `["7187227"]`
- `metadata`(示例): `{"query_id": "1048579", "split": "开发集", "benchmark_name": "msmarco_passage_ranking", "benchmark_type": "base_rag", "sub_benchmark": "passage_ranking"}`
## 参考文献与延伸阅读
### MS MARCO(原始数据集)
**MS MARCO: A Human Generated MAchine Reading COmprehension Dataset** — Bajaj等人,2016年。[arXiv:1611.09268](https://arxiv.org/abs/1611.09268)
> 摘要(节选):我们引入了大规模机器阅读理解数据集MS MARCO。该数据集包含1,010,916条匿名查询——从必应搜索查询日志中采样得到——每条查询均配有一条人工生成的答案,以及182,669条完全由人工重写生成的答案。此外,该数据集包含8,841,823条篇章——从必应检索到的3,563,535个网页文档中抽取得到——这些篇章为生成自然语言答案提供了必要信息。……我们基于该数据集提出了三个难度各异的任务:……(iii)给定问题,对一组检索到的篇章进行排序。
BibTeX格式引用(来自[MS MARCO引用页面](https://microsoft.github.io/msmarco/Datasets.html)):
bibtex
@article{bajaj2016ms,
title={Ms marco: A human generated machine reading comprehension dataset},
author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others},
journal={arXiv preprint arXiv:1611.09268},
year={2016}
}
### 篇章排序/神经基线模型
**An Updated Duet Model for Passage Re-ranking** — Mitra & Craswell,2019年。[arXiv:1903.07666](https://arxiv.org/abs/1903.07666)
> 摘要:我们对Duet——一款深度神经排序模型——提出了若干小幅改进,并在MS MARCO篇章排序任务上评估了改进后的模型。我们通过消融实验证明了所提出的改进能够带来显著性能提升。
### 官方资源
- [MS MARCO — 文档与篇章排序排行榜数据集](https://microsoft.github.io/msmarco/Datasets.html)
- [TREC深度学习赛道](https://microsoft.github.io/msmarco/TREC-Deep-Learning)(采用MS MARCO风格排序任务的盲测评估)
- [MSMARCO-Passage-Ranking(GitHub)](https://github.com/microsoft/MSMARCO-Passage-Ranking)
## 许可证与使用条款
底层MS MARCO数据需遵循微软发布的[条款与条件](https://microsoft.github.io/msmarco/Datasets.html#terms-and-conditions)(非商业研究用途;详情请参阅官方网站)。若发表使用MS MARCO或其衍生数据集的研究成果,请引用上述MS MARCO论文,并遵守原始许可证和使用限制。
## 引用说明
若使用本Hugging Face数据集,请引用**MS MARCO**(Bajaj等人,2016年),并根据出版要求注明该衍生打包版本。
提供机构:
orgrctera



