msmarco-hard-negatives-llm-scores

Name: msmarco-hard-negatives-llm-scores
Creator: maas
Published: 2025-12-05 16:46:00
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/opensearch-project/msmarco-hard-negatives-llm-scores

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for **MS MARCO Hard Negatives LLM Scores (OpenSearch)** ## Dataset Summary This dataset is derived from the **MS MARCO** train split([Hugging Face](https://huggingface.co/datasets/mteb/msmarco)) and provides **hard-negative mining** annotations to train retrieval systems. For each query from the source split, we retrieve the **top-100 candidate documents** using the [opensearch-project/opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) and attach **re-ranking scores** from bi-encoder teachers and cross-encoder teachers: [opensearch-project/opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1), [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5), [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl), [cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2), [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise), and [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight). > ⚠️ **Licensing/Usage:** Because this dataset is derived from MS MARCO, please review Microsoft’s terms before using this dataset. ([Microsoft GitHub](https://microsoft.github.io/msmarco/Datasets.html), [GitHub](https://github.com/microsoft/msmarco)) --- ## How to Load ```python import datasets ds = datasets.load_dataset("opensearch-project/msmarco-hard-negatives-llm-scores", split="train") ``` --- ## Training example Related training example: **opensearch-sparse-model-tuning-sample**. ([GitHub](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample)) To convert the dataset to text-only format for sample repo training: ```python import datasets # 1) Load datasets msmarco_hard_negatives = datasets.load_dataset( "opensearch-project/msmarco-hard-negatives-llm-scores", split="train" ) msmarco_queries = datasets.load_dataset("BeIR/msmarco", "queries")["queries"] msmarco_corpus = datasets.load_dataset("BeIR/msmarco", "corpus")["corpus"] # 2) fix occasional text encoding issues def transform_str(s): try: s = s.encode("latin1").decode("utf-8") return s except Exception: return s msmarco_corpus = msmarco_corpus.map( lambda x: {"text": transform_str(x["text"])}, num_proc=30 ) # 3) Build convenient lookup tables id_to_text = {_id: text for _id, text in zip(msmarco_corpus["_id"], msmarco_corpus["text"])} qid_to_text = {_id: text for _id, text in zip(msmarco_queries["_id"], msmarco_queries["text"])} # 4) Replace IDs with raw texts to get a text-only dataset msmarco_hard_negatives = msmarco_hard_negatives.map( lambda x: { "query": qid_to_text[x["query"]], "docs": [id_to_text[doc] for doc in x["docs"]], }, num_proc=30, ) # 5) Save to disk (directory will contain the text-only view) msmarco_hard_negatives.save_to_disk("data/msmarco_ft_llm_scores") ``` --- ## Citation If you use this dataset, **please cite**: [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403) ``` @misc{geng2024competitivesearchrelevanceinferencefree, title={Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers}, author={Zhichao Geng and Dongyu Ru and Yang Yang}, year={2024}, eprint={2411.04403}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2411.04403}, } ``` ## Related Papers - [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839) ## License This project is licensed under the [Apache v2.0 License](https://github.com/opensearch-project/neural-search/blob/main/LICENSE). --- ## Copyright Copyright OpenSearch Contributors. See [NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE) for details.

# MS MARCO 难负样本大语言模型评分（OpenSearch）数据集卡片 ## 数据集概览本数据集源自**MS MARCO**训练子集（[Hugging Face](https://huggingface.co/datasets/mteb/msmarco)），旨在为检索系统训练提供难负样本挖掘（hard-negative mining）标注。针对源子集中的每个查询，我们使用[opensearch-project/opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1)检索得到前100个候选文档，并附加来自双编码器教师模型与交叉编码器教师模型的重排序评分，涉及模型包括：[opensearch-project/opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1)、[Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)、[BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl)、[cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2)、[BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise)以及[BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight)。 > ⚠️ **许可/使用须知：** 本数据集源自MS MARCO，使用前请查阅微软相关条款。([Microsoft GitHub](https://microsoft.github.io/msmarco/Datasets.html), [GitHub](https://github.com/microsoft/msmarco)) --- ## 如何加载数据集 python import datasets ds = datasets.load_dataset("opensearch-project/msmarco-hard-negatives-llm-scores", split="train") --- ## 训练示例相关训练示例：**opensearch-sparse-model-tuning-sample**。([GitHub](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample)) 若需将数据集转换为纯文本格式以适配示例仓库训练，请执行以下代码： python import datasets # 1) 加载数据集 msmarco_hard_negatives = datasets.load_dataset( "opensearch-project/msmarco-hard-negatives-llm-scores", split="train" ) msmarco_queries = datasets.load_dataset("BeIR/msmarco", "queries")["queries"] msmarco_corpus = datasets.load_dataset("BeIR/msmarco", "corpus")["corpus"] # 2) 修复偶发的文本编码问题 def transform_str(s): try: s = s.encode("latin1").decode("utf-8") return s except Exception: return s msmarco_corpus = msmarco_corpus.map( lambda x: {"text": transform_str(x["text"])}, num_proc=30 ) # 3) 构建便捷的查找表 id_to_text = {_id: text for _id, text in zip(msmarco_corpus["_id"], msmarco_corpus["text"])} qid_to_text = {_id: text for _id, text in zip(msmarco_queries["_id"], msmarco_queries["text"])} # 4) 将ID替换为原始文本，得到纯文本数据集 msmarco_hard_negatives = msmarco_hard_negatives.map( lambda x: { "query": qid_to_text[x["query"]], "docs": [id_to_text[doc] for doc in x["docs"]], }, num_proc=30, ) # 5) 保存至本地磁盘（生成的目录将包含纯文本格式的数据集） msmarco_hard_negatives.save_to_disk("data/msmarco_ft_llm_scores") --- ## 引用说明若使用本数据集，请引用以下论文：《Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers》(https://arxiv.org/abs/2411.04403) @misc{geng2024competitivesearchrelevanceinferencefree, title={Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers}, author={Zhichao Geng and Dongyu Ru and Yang Yang}, year={2024}, eprint={2411.04403}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2411.04403}, } ## 相关论文 - 《探索面向无推理稀疏检索器的$ell_0$稀疏化方法》(https://arxiv.org/abs/2504.14839) ## 许可协议本项目采用[Apache v2.0许可协议](https://github.com/opensearch-project/neural-search/blob/main/LICENSE)。 --- ## 版权声明版权归OpenSearch贡献者所有。详情请参阅[NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE)。

提供机构：

maas

创建时间：

2025-08-13

搜集汇总

数据集介绍