five

msmarco-corpus

收藏
魔搭社区2025-11-07 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/msmarco-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
# MS MARCO Corpus This dataset allows for a convenient mapping from MS MARCO query/passage ID to the query/passage text. This passage corpus was downloaded from https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz, and the queries from https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz (via Wayback Machine). ## Usage This dataset was designed to allow you to perform the following: ```python from datasets import load_dataset query_dataset = load_dataset("sentence-transformers/msmarco-corpus", "query", split="train") qid_to_query = dict(zip(query_dataset["qid"], query_dataset["text"])) print(qid_to_query[571018]) # => "what are the liberal arts?" passage_dataset = load_dataset("sentence-transformers/msmarco-corpus", "passage", split="train") pid_to_passage = dict(zip(passage_dataset["pid"], passage_dataset["text"])) print(pid_to_passage[7349777]) # => "liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects." ``` ## Related Datasets This dataset is used for the query and passage texts in the following datasets containing MS MARCO with mined hard negatives. * [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) * [msmarco-msmarco-distilbert-base-tas-b](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-tas-b) * [msmarco-msmarco-distilbert-base-v3](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3) * [msmarco-msmarco-MiniLM-L-6-v3](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3) * [msmarco-distilbert-margin-mse-cls-dot-v2](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2) * [msmarco-distilbert-margin-mse-cls-dot-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1) * [msmarco-distilbert-margin-mse-mean-dot-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1) * [msmarco-mpnet-margin-mse-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1) * [msmarco-co-condenser-margin-mse-cls-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1) * [msmarco-distilbert-margin-mse-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1) * [msmarco-distilbert-margin-mse-sym-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1) * [msmarco-distilbert-margin-mse-sym-mnrl-mean-v2](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2) * [msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1)

# MS MARCO 语料库 本数据集可实现MS MARCO查询/段落ID与对应查询/段落文本的便捷映射。该段落语料库下载自https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz,查询集则通过Wayback Machine从https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz获取。 ## 使用方法 本数据集旨在支持以下操作: python from datasets import load_dataset query_dataset = load_dataset("sentence-transformers/msmarco-corpus", "query", split="train") qid_to_query = dict(zip(query_dataset["qid"], query_dataset["text"])) print(qid_to_query[571018]) # => "what are the liberal arts?" passage_dataset = load_dataset("sentence-transformers/msmarco-corpus", "passage", split="train") pid_to_passage = dict(zip(passage_dataset["pid"], passage_dataset["text"])) print(pid_to_passage[7349777]) # => "liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects." ## 相关数据集 本数据集可配合以下搭载MS MARCO及挖掘得到的难负样本的数据集,用于获取其中的查询与段落文本: * [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) * [msmarco-msmarco-distilbert-base-tas-b](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-tas-b) * [msmarco-msmarco-distilbert-base-v3](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3) * [msmarco-msmarco-MiniLM-L-6-v3](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3) * [msmarco-distilbert-margin-mse-cls-dot-v2](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2) * [msmarco-distilbert-margin-mse-cls-dot-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1) * [msmarco-distilbert-margin-mse-mean-dot-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1) * [msmarco-mpnet-margin-mse-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1) * [msmarco-co-condenser-margin-mse-cls-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1) * [msmarco-distilbert-margin-mse-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1) * [msmarco-distilbert-margin-mse-sym-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1) * [msmarco-distilbert-margin-mse-sym-mnrl-mean-v2](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2) * [msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1)
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作