msmarco-hard-negatives
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/msmarco-hard-negatives
下载链接
链接失效反馈官方服务:
资源简介:
# MS MARCO Passages Hard Negatives
> [!NOTE]
> This repository contains raw datasets, all of which have also been formatted for easy training in the [MS MARCO Mined Triplets](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23) collection. We recommend looking there first.
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine.
This dataset repository contains files that are helpful to train bi-encoder models e.g. using [sentence-transformers](https://www.sbert.net).
## Training Code
You can find here an example how these files can be used to train bi-encoders: [SBERT.net - MS MARCO - MarginMSE](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse)
## cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz
This is a pickled dictionary in the format: `scores[qid][pid] -> cross_encoder_score`
It contains 160 million cross-encoder scores for (query, paragraph) pairs using the [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model.
## msmarco-hard-negatives.jsonl.gz
This is a jsonl file: Each line is a JSON object. It has the following format:
```
{"qid": 867436, "pos": [5238393], "neg": {"bm25": [...], ...}}
```
`qid` is the query-ID from MS MARCO, `pos` is a list with paragraph IDs for positive passages. `neg` is a dictionary where we mined hard negatives using different (mainly dense retrieval) systems.
It contains hard negatives mined from BM25 (using ElasticSearch) and the following dense models:
```
msmarco-distilbert-base-tas-b
msmarco-distilbert-base-v3
msmarco-MiniLM-L-6-v3
distilbert-margin_mse-cls-dot-v2
distilbert-margin_mse-cls-dot-v1
distilbert-margin_mse-mean-dot-v1
mpnet-margin_mse-mean-v1
co-condenser-margin_mse-cls-v1
distilbert-margin_mse-mnrl-mean-v1
distilbert-margin_mse-sym_mnrl-mean-v1
distilbert-margin_mse-sym_mnrl-mean-v2
co-condenser-margin_mse-sym_mnrl-mean-v1
```
From each system, 50 most similar paragraphs were mined for a given query.
# MS MARCO 段落难负样本数据集
> [!NOTE]
> 本仓库包含原始数据集,所有数据均已完成格式适配,可直接用于[MS MARCO 挖掘三元组](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23)数据集合集的训练任务,我们推荐优先参考该合集。
[MS MARCO](https://microsoft.github.io/msmarco/) 是基于必应(Bing)搜索引擎真实用户搜索查询构建的大规模信息检索语料库。
本数据集仓库提供的文件可用于训练双编码器模型,例如借助[sentence-transformers](https://www.sbert.net)库完成训练。
## 训练代码
你可以在此处找到如何使用这些文件训练双编码器模型的示例:[SBERT.net - MS MARCO - MarginMSE](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse)
## cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz
该文件为Python序列化字典,格式为:`scores[qid][pid] -> 交叉编码器得分`。
它包含使用[cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)模型计算的1.6亿条(查询,段落)对的交叉编码器得分。
## msmarco-hard-negatives.jsonl.gz
该文件为JSON Lines(jsonl)格式,每行均为一个JSON对象,格式如下:
{"qid": 867436, "pos": [5238393], "neg": {"bm25": [...], ...}}
其中`qid`为MS MARCO中的查询ID,`pos`为正样本段落的ID列表,`neg`为字典,存储了通过多种(主要为稠密检索)系统挖掘得到的难负样本。
该数据集包含从BM25(基于ElasticSearch实现)以及以下稠密模型中挖掘得到的难负样本:
msmarco-distilbert-base-tas-b
msmarco-distilbert-base-v3
msmarco-MiniLM-L-6-v3
distilbert-margin_mse-cls-dot-v2
distilbert-margin_mse-cls-dot-v1
distilbert-margin_mse-mean-dot-v1
mpnet-margin_mse-mean-v1
co-condenser-margin_mse-cls-v1
distilbert-margin_mse-mnrl-mean-v1
distilbert-margin_mse-sym_mnrl-mean-v1
distilbert-margin_mse-sym_mnrl-mean-v2
co-condenser-margin_mse-sym_mnrl-mean-v1
针对每个查询,从每个系统中选取最相似的50个段落作为难负样本。
提供机构:
maas
创建时间:
2025-01-06



