five

msmarco-hard-negatives

收藏
魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/msmarco-hard-negatives
下载链接
链接失效反馈
官方服务:
资源简介:
# MS MARCO Passages Hard Negatives > [!NOTE] > This repository contains raw datasets, all of which have also been formatted for easy training in the [MS MARCO Mined Triplets](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23) collection. We recommend looking there first. [MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. This dataset repository contains files that are helpful to train bi-encoder models e.g. using [sentence-transformers](https://www.sbert.net). ## Training Code You can find here an example how these files can be used to train bi-encoders: [SBERT.net - MS MARCO - MarginMSE](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse) ## cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz This is a pickled dictionary in the format: `scores[qid][pid] -> cross_encoder_score` It contains 160 million cross-encoder scores for (query, paragraph) pairs using the [cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) model. ## msmarco-hard-negatives.jsonl.gz This is a jsonl file: Each line is a JSON object. It has the following format: ``` {"qid": 867436, "pos": [5238393], "neg": {"bm25": [...], ...}} ``` `qid` is the query-ID from MS MARCO, `pos` is a list with paragraph IDs for positive passages. `neg` is a dictionary where we mined hard negatives using different (mainly dense retrieval) systems. It contains hard negatives mined from BM25 (using ElasticSearch) and the following dense models: ``` msmarco-distilbert-base-tas-b msmarco-distilbert-base-v3 msmarco-MiniLM-L-6-v3 distilbert-margin_mse-cls-dot-v2 distilbert-margin_mse-cls-dot-v1 distilbert-margin_mse-mean-dot-v1 mpnet-margin_mse-mean-v1 co-condenser-margin_mse-cls-v1 distilbert-margin_mse-mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v2 co-condenser-margin_mse-sym_mnrl-mean-v1 ``` From each system, 50 most similar paragraphs were mined for a given query.

# MS MARCO 段落难负样本数据集 > [!NOTE] > 本仓库包含原始数据集,所有数据均已完成格式适配,可直接用于[MS MARCO 挖掘三元组](https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets-6644d6f1ff58c5103fe65f23)数据集合集的训练任务,我们推荐优先参考该合集。 [MS MARCO](https://microsoft.github.io/msmarco/) 是基于必应(Bing)搜索引擎真实用户搜索查询构建的大规模信息检索语料库。 本数据集仓库提供的文件可用于训练双编码器模型,例如借助[sentence-transformers](https://www.sbert.net)库完成训练。 ## 训练代码 你可以在此处找到如何使用这些文件训练双编码器模型的示例:[SBERT.net - MS MARCO - MarginMSE](https://www.sbert.net/examples/training/ms_marco/README.html#marginmse) ## cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz 该文件为Python序列化字典,格式为:`scores[qid][pid] -> 交叉编码器得分`。 它包含使用[cross-encoder/ms-marco-MiniLM-L-6-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)模型计算的1.6亿条(查询,段落)对的交叉编码器得分。 ## msmarco-hard-negatives.jsonl.gz 该文件为JSON Lines(jsonl)格式,每行均为一个JSON对象,格式如下: {"qid": 867436, "pos": [5238393], "neg": {"bm25": [...], ...}} 其中`qid`为MS MARCO中的查询ID,`pos`为正样本段落的ID列表,`neg`为字典,存储了通过多种(主要为稠密检索)系统挖掘得到的难负样本。 该数据集包含从BM25(基于ElasticSearch实现)以及以下稠密模型中挖掘得到的难负样本: msmarco-distilbert-base-tas-b msmarco-distilbert-base-v3 msmarco-MiniLM-L-6-v3 distilbert-margin_mse-cls-dot-v2 distilbert-margin_mse-cls-dot-v1 distilbert-margin_mse-mean-dot-v1 mpnet-margin_mse-mean-v1 co-condenser-margin_mse-cls-v1 distilbert-margin_mse-mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v1 distilbert-margin_mse-sym_mnrl-mean-v2 co-condenser-margin_mse-sym_mnrl-mean-v1 针对每个查询,从每个系统中选取最相似的50个段落作为难负样本。
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作