msmarco-hard-negatives

Name: msmarco-hard-negatives
Creator: maas
Published: 2025-12-05 04:10:39
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/sentence-transformers/msmarco-hard-negatives

下载链接

链接失效反馈

官方服务：

资源简介：

# MS MARCO Passages Hard Negatives > [!NOTE] > This repository contains raw datasets, all of which have also been formatted for easy training in the [MS MARCO Mined Triplets](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23) collection. We recommend looking there first. [MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. This dataset repository contains files that are helpful to train bi-encoder models e.g. using [sentence-transformers](https://www.sbert.net). ## Training Code You can find here an example how these files can be used to train bi-encoders: [SBERT.net - MS MARCO - MarginMSE](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse) ## cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz This is a pickled dictionary in the format: `scores[qid][pid] -> cross_encoder_score` It contains 160 million cross-encoder scores for (query, paragraph) pairs using the [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model. ## msmarco-hard-negatives.jsonl.gz This is a jsonl file: Each line is a JSON object. It has the following format: ``` {"qid": 867436, "pos": [5238393], "neg": {"bm25": [...], ...}} ``` `qid` is the query-ID from MS MARCO, `pos` is a list with paragraph IDs for positive passages. `neg` is a dictionary where we mined hard negatives using different (mainly dense retrieval) systems. It contains hard negatives mined from BM25 (using ElasticSearch) and the following dense models: ``` msmarco-distilbert-base-tas-b msmarco-distilbert-base-v3 msmarco-MiniLM-L-6-v3 distilbert-margin_mse-cls-dot-v2 distilbert-margin_mse-cls-dot-v1 distilbert-margin_mse-mean-dot-v1 mpnet-margin_mse-mean-v1 co-condenser-margin_mse-cls-v1 distilbert-margin_mse-mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v2 co-condenser-margin_mse-sym_mnrl-mean-v1 ``` From each system, 50 most similar paragraphs were mined for a given query.

# MS MARCO 段落难负样本数据集 > [!NOTE] > 本仓库包含原始数据集，所有数据均已完成格式适配，可直接用于[MS MARCO 挖掘三元组](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23)数据集合集的训练任务，我们推荐优先参考该合集。 [MS MARCO](https://microsoft.github.io/msmarco/) 是基于必应（Bing）搜索引擎真实用户搜索查询构建的大规模信息检索语料库。本数据集仓库提供的文件可用于训练双编码器模型，例如借助[sentence-transformers](https://www.sbert.net)库完成训练。 ## 训练代码你可以在此处找到如何使用这些文件训练双编码器模型的示例：[SBERT.net - MS MARCO - MarginMSE](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse) ## cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz 该文件为Python序列化字典，格式为：`scores[qid][pid] -> 交叉编码器得分`。它包含使用[cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)模型计算的1.6亿条（查询，段落）对的交叉编码器得分。 ## msmarco-hard-negatives.jsonl.gz 该文件为JSON Lines（jsonl）格式，每行均为一个JSON对象，格式如下： {"qid": 867436, "pos": [5238393], "neg": {"bm25": [...], ...}} 其中`qid`为MS MARCO中的查询ID，`pos`为正样本段落的ID列表，`neg`为字典，存储了通过多种（主要为稠密检索）系统挖掘得到的难负样本。该数据集包含从BM25（基于ElasticSearch实现）以及以下稠密模型中挖掘得到的难负样本： msmarco-distilbert-base-tas-b msmarco-distilbert-base-v3 msmarco-MiniLM-L-6-v3 distilbert-margin_mse-cls-dot-v2 distilbert-margin_mse-cls-dot-v1 distilbert-margin_mse-mean-dot-v1 mpnet-margin_mse-mean-v1 co-condenser-margin_mse-cls-v1 distilbert-margin_mse-mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v2 co-condenser-margin_mse-sym_mnrl-mean-v1 针对每个查询，从每个系统中选取最相似的50个段落作为难负样本。

提供机构：

maas

创建时间：

2025-01-06

搜集汇总

数据集介绍