five

orgrctera/msmarco_document_ranking

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/orgrctera/msmarco_document_ranking
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-retrieval language: - en tags: - information-retrieval - document-ranking - msmarco - benchmark - retrieval pretty_name: MS MARCO Document Ranking (CTERA format) size_categories: - "100K<n<1M" --- # MS MARCO Document Ranking ## Dataset description and background **MS MARCO** (MicroSoft MAchine Reading COmprehension) is a large-scale collection originally introduced for machine reading comprehension and question answering. Over time it has become a standard benchmark for **information retrieval** under abundant training data: hundreds of thousands of queries with human relevance signals, aligned with real web search behavior. The **document ranking** track uses a **document-level corpus** derived from the same ecosystem as MS MARCO passage ranking. The official resource describes a corpus on the order of **~3.2M documents**, with **training queries on the order of hundreds of thousands**, **development** and **leaderboard test** query sets, and **TREC-style qrels** (query–relevance judgments) for training and development. For training, **passage-level labels are mapped to document IDs** under the assumption that a document containing a judged-relevant passage is treated as a relevant document—supporting transfer between passage-focused and document-focused retrieval research. This Hugging Face dataset is a **CTERA-packaged view** of that task: each row pairs a **natural-language query** with **structured labels/metadata** suitable for retrieval and RAG benchmarking (see [Data fields](#data-fields) below). **Official MS MARCO ranking resources (corpus, qrels, leaderboards):** [MS MARCO — Datasets for Document and Passage Ranking](https://microsoft.github.io/msmarco/Datasets.html) ## Task: document ranking / retrieval This dataset supports **ad-hoc document retrieval / ranking**: given a **query**, a system should **rank documents** from a collection by **relevance**. In research settings this is often split into: - **Full ranking (retrieval):** score or retrieve from the **full document collection** (official submissions allow a bounded number of documents per query, e.g. up to **100** in the MS MARCO document ranking setup). - **Top‑k reranking:** rerank a fixed candidate list (e.g. **top‑100** candidates from a first-stage retriever)—a common production pattern of “retrieve, then rerank.” **Evaluation** in the MS MARCO ranking leaderboards is typically reported with **MRR@10** (Mean Reciprocal Rank at rank 10) for the document ranking task, alongside standard TREC-style analyses where applicable. (Consult the official leaderboard and TREC Deep Learning track materials for the exact metric definitions used in a given campaign.) ## Data fields Parquet splits expose three columns: | Column | Description | |--------|-------------| | `input` | The query text (string). | | `expected_output` | JSON string with relevance information; **format differs by split** (see examples). | | `metadata` | JSON string with identifiers and benchmark tags (`benchmark_name`, `split`, `query_id`, etc.). | **Splits:** `train`, `dev`, and `test` are provided. The **test** split may use **held-out** or **empty** labels in `expected_output` for blind evaluation workflows—check the sample below. ## Examples **Train** (`expected_output` includes query id and judged relevant document id(s) as JSON): ```json { "input": ")what was the immediate impact of the success of the manhattan project?", "expected_output": "{\"qid\": \"1185869\", \"qrels\": [\"D59219\"]}", "metadata": "{\"query_id\": \"1185869\", \"split\": \"train\", \"benchmark_name\": \"msmarco_document_ranking\", \"benchmark_type\": \"base_rag\", \"sub_benchmark\": \"document_ranking\"}" } ``` **Dev** (`expected_output` is a JSON-encoded list of relevant document ids): ```json { "input": "does xpress bet charge to deposit money in your account", "expected_output": "[\"D1987644\"]", "metadata": "{\"query_id\": \"174249\", \"split\": \"dev\", \"benchmark_name\": \"msmarco_document_ranking\", \"benchmark_type\": \"base_rag\", \"sub_benchmark\": \"document_ranking\"}" } ``` **Test** (labels may be withheld; example shows an empty list): ```json { "input": "how to display how.close you are to.cell.tower", "expected_output": "[]", "metadata": "{\"query_id\": \"355339\", \"split\": \"test\", \"benchmark_name\": \"msmarco_document_ranking\", \"benchmark_type\": \"base_rag\", \"sub_benchmark\": \"document_ranking\"}" } ``` ## References ### Foundational MS MARCO paper (cite when using MS MARCO-derived data) **Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang.** *MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.* arXiv:1611.09268, 2016. **Abstract (short):** MS MARCO introduces a large-scale machine reading comprehension dataset built from real Bing queries, with human-generated answers and millions of passages from retrieved web documents. The work defines multiple tasks of varying difficulty, including passage ranking—establishing MS MARCO as a benchmark for realistic, large-scale QA and IR research. - Paper: [arXiv:1611.09268](https://arxiv.org/abs/1611.09268) ```bibtex @article{bajaj2016ms, title={Ms marco: A human generated machine reading comprehension dataset}, author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others}, journal={arXiv preprint arXiv:1611.09268}, year={2016} } ``` ### Official dataset and leaderboard documentation - [MS MARCO — Datasets for Document and Passage Ranking Leaderboards](https://microsoft.github.io/msmarco/Datasets.html) — corpus files, qrels, submission formats, and task description. - [MS MARCO — Submission / evaluation](https://microsoft.github.io/msmarco/Submission.html) — ranking submission conventions. - [TREC Deep Learning Track](https://microsoft.github.io/msmarco/TREC-Deep-Learning) — blind evaluation and community benchmarks related to MS MARCO ranking tasks. ### Related code and corpora (Microsoft) - [microsoft/MSMARCO-Document-Ranking](https://github.com/microsoft/MSMARCO-Document-Ranking) — pointers and tooling around the document ranking collection. ## Terms and licensing The **underlying MS MARCO data** is subject to Microsoft’s **terms for non-commercial research** as published on the official MS MARCO site; review the **Terms and Conditions** on [the official datasets page](https://microsoft.github.io/msmarco/Datasets.html) before use in products or redistributions. This Hugging Face dataset card describes the **CTERA-formatted** release; when publishing results, cite **MS MARCO** appropriately and follow the original **license / usage** constraints for the source corpus and judgments. ## Acknowledgments Dataset packaging for this repository is maintained by **CTERA**. **MS MARCO** is provided by **Microsoft** and the broader IR community; see the official site for credits and contact information.
提供机构:
orgrctera
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作