five

msmarco-hard-negatives-llm-scores

收藏
魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/opensearch-project/msmarco-hard-negatives-llm-scores
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for **MS MARCO Hard Negatives LLM Scores (OpenSearch)** ## Dataset Summary This dataset is derived from the **MS MARCO** train split([Hugging Face](https://huggingface.co/datasets/mteb/msmarco)) and provides **hard-negative mining** annotations to train retrieval systems. For each query from the source split, we retrieve the **top-100 candidate documents** using the [opensearch-project/opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1) and attach **re-ranking scores** from bi-encoder teachers and cross-encoder teachers: [opensearch-project/opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1), [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5), [BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl), [cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2), [BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise), and [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight). > ⚠️ **Licensing/Usage:** Because this dataset is derived from MS MARCO, please review Microsoft’s terms before using this dataset. ([Microsoft GitHub](https://microsoft.github.io/msmarco/Datasets.html), [GitHub](https://github.com/microsoft/msmarco)) --- ## How to Load ```python import datasets ds = datasets.load_dataset("opensearch-project/msmarco-hard-negatives-llm-scores", split="train") ``` --- ## Training example Related training example: **opensearch-sparse-model-tuning-sample**. ([GitHub](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample)) To convert the dataset to text-only format for sample repo training: ```python import datasets # 1) Load datasets msmarco_hard_negatives = datasets.load_dataset( "opensearch-project/msmarco-hard-negatives-llm-scores", split="train" ) msmarco_queries = datasets.load_dataset("BeIR/msmarco", "queries")["queries"] msmarco_corpus = datasets.load_dataset("BeIR/msmarco", "corpus")["corpus"] # 2) fix occasional text encoding issues def transform_str(s): try: s = s.encode("latin1").decode("utf-8") return s except Exception: return s msmarco_corpus = msmarco_corpus.map( lambda x: {"text": transform_str(x["text"])}, num_proc=30 ) # 3) Build convenient lookup tables id_to_text = {_id: text for _id, text in zip(msmarco_corpus["_id"], msmarco_corpus["text"])} qid_to_text = {_id: text for _id, text in zip(msmarco_queries["_id"], msmarco_queries["text"])} # 4) Replace IDs with raw texts to get a text-only dataset msmarco_hard_negatives = msmarco_hard_negatives.map( lambda x: { "query": qid_to_text[x["query"]], "docs": [id_to_text[doc] for doc in x["docs"]], }, num_proc=30, ) # 5) Save to disk (directory will contain the text-only view) msmarco_hard_negatives.save_to_disk("data/msmarco_ft_llm_scores") ``` --- ## Citation If you use this dataset, **please cite**: [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403) ``` @misc{geng2024competitivesearchrelevanceinferencefree, title={Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers}, author={Zhichao Geng and Dongyu Ru and Yang Yang}, year={2024}, eprint={2411.04403}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2411.04403}, } ``` ## Related Papers - [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839) ## License This project is licensed under the [Apache v2.0 License](https://github.com/opensearch-project/neural-search/blob/main/LICENSE). --- ## Copyright Copyright OpenSearch Contributors. See [NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE) for details.

# MS MARCO 难负样本大语言模型评分(OpenSearch)数据集卡片 ## 数据集概览 本数据集源自**MS MARCO**训练子集([Hugging Face](https://huggingface.co/datasets/mteb/msmarco)),旨在为检索系统训练提供难负样本挖掘(hard-negative mining)标注。针对源子集中的每个查询,我们使用[opensearch-project/opensearch-neural-sparse-encoding-doc-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v1)检索得到前100个候选文档,并附加来自双编码器教师模型与交叉编码器教师模型的重排序评分,涉及模型包括:[opensearch-project/opensearch-neural-sparse-encoding-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1)、[Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5)、[BAAI/bge-en-icl](https://huggingface.co/BAAI/bge-en-icl)、[cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2)、[BAAI/bge-reranker-v2-minicpm-layerwise](https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise)以及[BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight)。 > ⚠️ **许可/使用须知:** 本数据集源自MS MARCO,使用前请查阅微软相关条款。([Microsoft GitHub](https://microsoft.github.io/msmarco/Datasets.html), [GitHub](https://github.com/microsoft/msmarco)) --- ## 如何加载数据集 python import datasets ds = datasets.load_dataset("opensearch-project/msmarco-hard-negatives-llm-scores", split="train") --- ## 训练示例 相关训练示例:**opensearch-sparse-model-tuning-sample**。([GitHub](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample)) 若需将数据集转换为纯文本格式以适配示例仓库训练,请执行以下代码: python import datasets # 1) 加载数据集 msmarco_hard_negatives = datasets.load_dataset( "opensearch-project/msmarco-hard-negatives-llm-scores", split="train" ) msmarco_queries = datasets.load_dataset("BeIR/msmarco", "queries")["queries"] msmarco_corpus = datasets.load_dataset("BeIR/msmarco", "corpus")["corpus"] # 2) 修复偶发的文本编码问题 def transform_str(s): try: s = s.encode("latin1").decode("utf-8") return s except Exception: return s msmarco_corpus = msmarco_corpus.map( lambda x: {"text": transform_str(x["text"])}, num_proc=30 ) # 3) 构建便捷的查找表 id_to_text = {_id: text for _id, text in zip(msmarco_corpus["_id"], msmarco_corpus["text"])} qid_to_text = {_id: text for _id, text in zip(msmarco_queries["_id"], msmarco_queries["text"])} # 4) 将ID替换为原始文本,得到纯文本数据集 msmarco_hard_negatives = msmarco_hard_negatives.map( lambda x: { "query": qid_to_text[x["query"]], "docs": [id_to_text[doc] for doc in x["docs"]], }, num_proc=30, ) # 5) 保存至本地磁盘(生成的目录将包含纯文本格式的数据集) msmarco_hard_negatives.save_to_disk("data/msmarco_ft_llm_scores") --- ## 引用说明 若使用本数据集,请引用以下论文: 《Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers》(https://arxiv.org/abs/2411.04403) @misc{geng2024competitivesearchrelevanceinferencefree, title={Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers}, author={Zhichao Geng and Dongyu Ru and Yang Yang}, year={2024}, eprint={2411.04403}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2411.04403}, } ## 相关论文 - 《探索面向无推理稀疏检索器的$ell_0$稀疏化方法》(https://arxiv.org/abs/2504.14839) ## 许可协议 本项目采用[Apache v2.0许可协议](https://github.com/opensearch-project/neural-search/blob/main/LICENSE)。 --- ## 版权声明 版权归OpenSearch贡献者所有。详情请参阅[NOTICE](https://github.com/opensearch-project/neural-search/blob/main/NOTICE)。
提供机构:
maas
创建时间:
2025-08-13
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集基于MS MARCO训练分割,旨在通过硬负例挖掘来训练检索系统。它为每个查询提供了前100个候选文档,并包含了来自多个教师模型的重新排序分数。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作