lumees/ms-marco-tr-hard-negatives

Name: lumees/ms-marco-tr-hard-negatives
Creator: lumees
Published: 2025-11-27 13:29:05
License: 暂无描述

Hugging Face2025-11-27 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/lumees/ms-marco-tr-hard-negatives

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - tr tags: - search - information-retrieval - sentence-transformers - msmarco - hard-negatives license: mit task_categories: - text-retrieval source_datasets: - parsak/msmarco-tr size_categories: - 100K<n<1M --- # MS MARCO TR - Hard Negatives Dataset ## Dataset Description * **Source Repository:** Derived from [parsak/msmarco-tr](https://huggingface.co/datasets/parsak/msmarco-tr). * **Language:** Turkish (`tr`) * **Task:** Semantic Search, Dense Retrieval, Embedding Training * **Size:** ~500k - 1M Training Triplets (Depending on processed queries) * **Authors:** Lumees AI, Hasan Kurşun, Kerem Berkay Yanık * **Year:** 2025 * **Website:** [lumees.io](https://lumees.io) ### Dataset Summary This dataset contains **Hard Negatives** specifically mined for the Turkish MS MARCO dataset. It is designed for training or fine-tuning sentence embedding models (e.g., SBERT) for Turkish Information Retrieval tasks. [Image of vector space diagram showing query positive hard negative and random negative] Unlike standard random negatives, these "hard" negatives are passages that share **high semantic similarity** (high vector similarity) with the query but are **not the correct answer**. Training on this data forces the model to learn subtle semantic distinctions, significantly improving retrieval performance. ### Creation Process 1. **Source Data:** Training queries and passages were taken from the `parsak/msmarco-tr` dataset (Machine translated MS MARCO). 2. **Mining Model:** The [emrecan/bert-base-turkish-cased-mean-nli-stsb-tr](https://huggingface.co/emrecan/bert-base-turkish-cased-mean-nli-stsb-tr) model was used. 3. **Method:** * **Encoding:** All queries and passages were converted into dense vectors using the mining model. * **Retrieval:** For each query, the top 100 similar passages were retrieved using **Faiss** (Inner Product). * **Filtering:** The true positive (correct answer) was removed from the results. * **Safety Threshold:** Passages with a similarity score higher than **0.98** were discarded to prevent "False Negatives" (correct answers accidentally labeled as negative). * **Selection:** From the remaining candidates, the top **10** passages with the highest scores were selected as Hard Negatives. --- ## Data Structure ### Data Examples Each line in the `.jsonl` file is a valid JSON object representing a training example. ```json { "query": "Manhattan projesinin başarısının hemen etkisi neydi?", "pos": "Manhattan Projesi ve atom bombası, İkinci Dünya Savaşı'nın sona ermesine yardımcı oldu...", "negatives": [ "Manhattan Projesi, II. Dünya Savaşı sırasında ilk atom bombasını geliştirmek için...", "Proje, nükleer silah üretimi üzerine odaklanmıştı...", "..." ], "scores": [ 0.874, 0.852, "..." ] } ```` ### Data Fields * **`query`** (string): The search query. * **`pos`** (string): The true positive passage (correct answer). * **`negatives`** (list of strings): A list of 10 passages that are semantically close to the query but incorrect. Sorted by similarity (highest to lowest). * **`scores`** (list of floats): Cosine similarity scores corresponding to the passages in the `negatives` list. Useful for margin-based filtering or weighted loss functions during training. ----- ## Usage Guide ### Loading the Dataset (Python) You can load this dataset using the Hugging Face `datasets` library or standard JSON line reading methods. ```python from datasets import load_dataset # If uploaded to Hugging Face ds = load_dataset("lumees/msmarco-tr-hard-negatives", split="train") # If loading from a local file ds = load_dataset("json", data_files="msmarco_tr_hard_negatives_final.jsonl", split="train") print(ds[0]) ``` ### Training with Sentence Transformers This dataset is optimized for loss functions like `MultipleNegativesRankingLoss` or `InfoNCE`. ```python from sentence_transformers import InputExample train_examples = [] for row in ds: # Structure: [Query, Positive, Negative1, Negative2, ...] texts = [row['query'], row['pos']] + row['negatives'] train_examples.append(InputExample(texts=texts)) # Note: Ensure the Loss function you are using supports multiple negatives per example. ``` ----- ## Limitations & Bias 1. **Translation Errors:** The original `parsak/msmarco-tr` dataset was created via machine translation from English. Therefore, some Turkish expressions may not be natural or may contain translation errors. 2. **False Negatives:** Despite the `0.98` similarity filter, there is a possibility that some passages selected as "negatives" are actually correct answers that were not labeled in the original dataset. 3. **Model Bias:** The negatives were mined using the `emrecan/bert-base-turkish` model. The dataset naturally reflects the biases and semantic understanding of this base model. ## Citation If you use this dataset, please cite Lumees AI, the original MS MARCO authors, and the Turkish translation source as follows: ```bibtex @misc{lumees_msmarco_hn_2025, author = {Lumees AI and Kurşun, Hasan and Yanık, Kerem Berkay}, title = {MS MARCO TR - Hard Negatives Dataset}, year = {2025}, howpublished = {\url{[https://lumees.io](https://lumees.io)}}, } @article{bajaj2016ms, title={MS MARCO: A Human Generated Machine Reading Comprehension Dataset}, author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others}, journal={arXiv preprint arXiv:1611.09268}, year={2016} } @misc{parsak_msmarco_tr, author = {Parsak}, title = {MS MARCO Turkish Translation}, year = {2023}, publisher = {Hugging Face}, journal = {Hugging Face Hub}, howpublished = {\url{[https://huggingface.co/datasets/parsak/msmarco-tr](https://huggingface.co/datasets/parsak/msmarco-tr)}} } ```

提供机构：

lumees

5,000+

优质数据集

54 个

任务类型

进入经典数据集