five

sentence-transformers/msmarco-corpus

收藏
Hugging Face2024-05-06 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/sentence-transformers/msmarco-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en multilinguality: - monolingual size_categories: - 1M<n<10M pretty_name: MS MARCO corpus dataset_info: - config_name: passage features: - name: pid dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 3089201649 num_examples: 8841823 download_size: 1688656108 dataset_size: 3089201649 - config_name: query features: - name: qid dtype: int64 - name: text dtype: string splits: - name: train num_bytes: 48033044 num_examples: 1010916 download_size: 34858846 dataset_size: 48033044 configs: - config_name: passage data_files: - split: train path: passage/train-* - config_name: query data_files: - split: train path: queries/train-* --- # MS MARCO Corpus This dataset allows for a convenient mapping from MS MARCO query/passage ID to the query/passage text. This passage corpus was downloaded from https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz, and the queries from https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz (via Wayback Machine). ## Usage This dataset was designed to allow you to perform the following: ```python from datasets import load_dataset query_dataset = load_dataset("sentence-transformers/msmarco-corpus", "query", split="train") qid_to_query = dict(zip(query_dataset["qid"], query_dataset["text"])) print(qid_to_query[571018]) # => "what are the liberal arts?" passage_dataset = load_dataset("sentence-transformers/msmarco-corpus", "passage", split="train") pid_to_passage = dict(zip(passage_dataset["pid"], passage_dataset["text"])) print(pid_to_passage[7349777]) # => "liberal arts. 1. the academic course of instruction at a college intended to provide general knowledge and comprising the arts, humanities, natural sciences, and social sciences, as opposed to professional or technical subjects." ``` ## Related Datasets This dataset is used for the query and passage texts in the following datasets containing MS MARCO with mined hard negatives. * [msmarco-bm25](https://huggingface.co/datasets/sentence-transformers/msmarco-bm25) * [msmarco-msmarco-distilbert-base-tas-b](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-tas-b) * [msmarco-msmarco-distilbert-base-v3](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3) * [msmarco-msmarco-MiniLM-L-6-v3](https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L-6-v3) * [msmarco-distilbert-margin-mse-cls-dot-v2](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2) * [msmarco-distilbert-margin-mse-cls-dot-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1) * [msmarco-distilbert-margin-mse-mean-dot-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1) * [msmarco-mpnet-margin-mse-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1) * [msmarco-co-condenser-margin-mse-cls-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1) * [msmarco-distilbert-margin-mse-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1) * [msmarco-distilbert-margin-mse-sym-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1) * [msmarco-distilbert-margin-mse-sym-mnrl-mean-v2](https://huggingface.co/datasets/sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2) * [msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1](https://huggingface.co/datasets/sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1)
提供机构:
sentence-transformers
原始信息汇总

数据集概述

基本信息

  • 名称: MS MARCO Corpus
  • 语言: 英语
  • 多语言性: 单语种
  • 大小: 1M<n<10M

数据集结构

配置名称: passage

  • 特征:
    • pid: 整数型 (int64)
    • text: 字符串型 (string)
  • 分割:
    • train:
      • 字节数: 3089201649
      • 示例数: 8841823
      • 下载大小: 1688656108
      • 数据集大小: 3089201649

配置名称: query

  • 特征:
    • qid: 整数型 (int64)
    • text: 字符串型 (string)
  • 分割:
    • train:
      • 字节数: 48033044
      • 示例数: 1010916
      • 下载大小: 34858846
      • 数据集大小: 48033044

数据文件

  • 配置名称: passage
    • 分割: train
    • 路径: passage/train-*
  • 配置名称: query
    • 分割: train
    • 路径: queries/train-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作