msmarco-hard-negatives-llm-scores
收藏魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/opensearch-project/msmarco-hard-negatives-llm-scores
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for **MS MARCO Hard Negatives LLM Scores (OpenSearch)**
## Dataset Summary
This dataset is derived from the **MS MARCO** train split([Hugging Face](https://huggingface.co/datasets/mteb/msmarco)) and provides **hard-negative mining** annotations to train retrieval systems. For each query from the source split, we retrieve the **top-100 candidate documents** using the [opensearch-project/opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) and attach **re-ranking scores** from bi-encoder teachers and cross-encoder teachers: [opensearch-project/opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1), [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5), [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl), [cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2), [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise), and [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight).
> ⚠️ **Licensing/Usage:** Because this dataset is derived from MS MARCO, please review Microsoft’s terms before using this dataset. ([Microsoft GitHub](https://microsoft.github.io/msmarco/Datasets.html), [GitHub](https://github.com/microsoft/msmarco))
---
## How to Load
```python
import datasets
ds = datasets.load_dataset("opensearch-project/msmarco-hard-negatives-llm-scores", split="train")
```
---
## Training example
Related training example: **opensearch-sparse-model-tuning-sample**. ([GitHub](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample))
To convert the dataset to text-only format for sample repo training:
```python
import datasets
# 1) Load datasets
msmarco_hard_negatives = datasets.load_dataset(
"opensearch-project/msmarco-hard-negatives-llm-scores", split="train"
)
msmarco_queries = datasets.load_dataset("BeIR/msmarco", "queries")["queries"]
msmarco_corpus = datasets.load_dataset("BeIR/msmarco", "corpus")["corpus"]
# 2) fix occasional text encoding issues
def transform_str(s):
try:
s = s.encode("latin1").decode("utf-8")
return s
except Exception:
return s
msmarco_corpus = msmarco_corpus.map(
lambda x: {"text": transform_str(x["text"])}, num_proc=30
)
# 3) Build convenient lookup tables
id_to_text = {_id: text for _id, text in zip(msmarco_corpus["_id"], msmarco_corpus["text"])}
qid_to_text = {_id: text for _id, text in zip(msmarco_queries["_id"], msmarco_queries["text"])}
# 4) Replace IDs with raw texts to get a text-only dataset
msmarco_hard_negatives = msmarco_hard_negatives.map(
lambda x: {
"query": qid_to_text[x["query"]],
"docs": [id_to_text[doc] for doc in x["docs"]],
},
num_proc=30,
)
# 5) Save to disk (directory will contain the text-only view)
msmarco_hard_negatives.save_to_disk("data/msmarco_ft_llm_scores")
```
---
## Citation
If you use this dataset, **please cite**:
[Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
```
@misc{geng2024competitivesearchrelevanceinferencefree,
title={Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers},
author={Zhichao Geng and Dongyu Ru and Yang Yang},
year={2024},
eprint={2411.04403},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2411.04403},
}
```
## Related Papers
- [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
## License
This project is licensed under the [Apache v2.0 License](https://github.com/opensearch-project/neural-search/blob/main/LICENSE).
---
## Copyright
Copyright OpenSearch Contributors. See [NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE) for details.
# MS MARCO 难负样本大语言模型评分(OpenSearch)数据集卡片
## 数据集概览
本数据集源自**MS MARCO**训练子集([Hugging Face](https://huggingface.co/datasets/mteb/msmarco)),旨在为检索系统训练提供难负样本挖掘(hard-negative mining)标注。针对源子集中的每个查询,我们使用[opensearch-project/opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1)检索得到前100个候选文档,并附加来自双编码器教师模型与交叉编码器教师模型的重排序评分,涉及模型包括:[opensearch-project/opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1)、[Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)、[BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl)、[cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2)、[BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise)以及[BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight)。
> ⚠️ **许可/使用须知:** 本数据集源自MS MARCO,使用前请查阅微软相关条款。([Microsoft GitHub](https://microsoft.github.io/msmarco/Datasets.html), [GitHub](https://github.com/microsoft/msmarco))
---
## 如何加载数据集
python
import datasets
ds = datasets.load_dataset("opensearch-project/msmarco-hard-negatives-llm-scores", split="train")
---
## 训练示例
相关训练示例:**opensearch-sparse-model-tuning-sample**。([GitHub](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample))
若需将数据集转换为纯文本格式以适配示例仓库训练,请执行以下代码:
python
import datasets
# 1) 加载数据集
msmarco_hard_negatives = datasets.load_dataset(
"opensearch-project/msmarco-hard-negatives-llm-scores", split="train"
)
msmarco_queries = datasets.load_dataset("BeIR/msmarco", "queries")["queries"]
msmarco_corpus = datasets.load_dataset("BeIR/msmarco", "corpus")["corpus"]
# 2) 修复偶发的文本编码问题
def transform_str(s):
try:
s = s.encode("latin1").decode("utf-8")
return s
except Exception:
return s
msmarco_corpus = msmarco_corpus.map(
lambda x: {"text": transform_str(x["text"])}, num_proc=30
)
# 3) 构建便捷的查找表
id_to_text = {_id: text for _id, text in zip(msmarco_corpus["_id"], msmarco_corpus["text"])}
qid_to_text = {_id: text for _id, text in zip(msmarco_queries["_id"], msmarco_queries["text"])}
# 4) 将ID替换为原始文本,得到纯文本数据集
msmarco_hard_negatives = msmarco_hard_negatives.map(
lambda x: {
"query": qid_to_text[x["query"]],
"docs": [id_to_text[doc] for doc in x["docs"]],
},
num_proc=30,
)
# 5) 保存至本地磁盘(生成的目录将包含纯文本格式的数据集)
msmarco_hard_negatives.save_to_disk("data/msmarco_ft_llm_scores")
---
## 引用说明
若使用本数据集,请引用以下论文:
《Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers》(https://arxiv.org/abs/2411.04403)
@misc{geng2024competitivesearchrelevanceinferencefree,
title={Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers},
author={Zhichao Geng and Dongyu Ru and Yang Yang},
year={2024},
eprint={2411.04403},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2411.04403},
}
## 相关论文
- 《探索面向无推理稀疏检索器的$ell_0$稀疏化方法》(https://arxiv.org/abs/2504.14839)
## 许可协议
本项目采用[Apache v2.0许可协议](https://github.com/opensearch-project/neural-search/blob/main/LICENSE)。
---
## 版权声明
版权归OpenSearch贡献者所有。详情请参阅[NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE)。
提供机构:
maas
创建时间:
2025-08-13
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集基于MS MARCO训练分割,旨在通过硬负例挖掘来训练检索系统。它为每个查询提供了前100个候选文档,并包含了来自多个教师模型的重新排序分数。
以上内容由遇见数据集搜集并总结生成



