lumees/ms-marco-tr-hard-negatives
收藏Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/lumees/ms-marco-tr-hard-negatives
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tr
tags:
- search
- information-retrieval
- sentence-transformers
- msmarco
- hard-negatives
license: mit
task_categories:
- text-retrieval
source_datasets:
- parsak/msmarco-tr
size_categories:
- 100K<n<1M
---
# MS MARCO TR - Hard Negatives Dataset
## Dataset Description
* **Source Repository:** Derived from [parsak/msmarco-tr](https://huggingface.co/datasets/parsak/msmarco-tr).
* **Language:** Turkish (`tr`)
* **Task:** Semantic Search, Dense Retrieval, Embedding Training
* **Size:** ~500k - 1M Training Triplets (Depending on processed queries)
* **Authors:** Lumees AI, Hasan Kurşun, Kerem Berkay Yanık
* **Year:** 2025
* **Website:** [lumees.io](https://lumees.io)
### Dataset Summary
This dataset contains **Hard Negatives** specifically mined for the Turkish MS MARCO dataset. It is designed for training or fine-tuning sentence embedding models (e.g., SBERT) for Turkish Information Retrieval tasks.
[Image of vector space diagram showing query positive hard negative and random negative]
Unlike standard random negatives, these "hard" negatives are passages that share **high semantic similarity** (high vector similarity) with the query but are **not the correct answer**. Training on this data forces the model to learn subtle semantic distinctions, significantly improving retrieval performance.
### Creation Process
1. **Source Data:** Training queries and passages were taken from the `parsak/msmarco-tr` dataset (Machine translated MS MARCO).
2. **Mining Model:** The [emrecan/bert-base-turkish-cased-mean-nli-stsb-tr](https://huggingface.co/emrecan/bert-base-turkish-cased-mean-nli-stsb-tr) model was used.
3. **Method:**
* **Encoding:** All queries and passages were converted into dense vectors using the mining model.
* **Retrieval:** For each query, the top 100 similar passages were retrieved using **Faiss** (Inner Product).
* **Filtering:** The true positive (correct answer) was removed from the results.
* **Safety Threshold:** Passages with a similarity score higher than **0.98** were discarded to prevent "False Negatives" (correct answers accidentally labeled as negative).
* **Selection:** From the remaining candidates, the top **10** passages with the highest scores were selected as Hard Negatives.
---
## Data Structure
### Data Examples
Each line in the `.jsonl` file is a valid JSON object representing a training example.
```json
{
"query": "Manhattan projesinin başarısının hemen etkisi neydi?",
"pos": "Manhattan Projesi ve atom bombası, İkinci Dünya Savaşı'nın sona ermesine yardımcı oldu...",
"negatives": [
"Manhattan Projesi, II. Dünya Savaşı sırasında ilk atom bombasını geliştirmek için...",
"Proje, nükleer silah üretimi üzerine odaklanmıştı...",
"..."
],
"scores": [
0.874,
0.852,
"..."
]
}
````
### Data Fields
* **`query`** (string): The search query.
* **`pos`** (string): The true positive passage (correct answer).
* **`negatives`** (list of strings): A list of 10 passages that are semantically close to the query but incorrect. Sorted by similarity (highest to lowest).
* **`scores`** (list of floats): Cosine similarity scores corresponding to the passages in the `negatives` list. Useful for margin-based filtering or weighted loss functions during training.
-----
## Usage Guide
### Loading the Dataset (Python)
You can load this dataset using the Hugging Face `datasets` library or standard JSON line reading methods.
```python
from datasets import load_dataset
# If uploaded to Hugging Face
ds = load_dataset("lumees/msmarco-tr-hard-negatives", split="train")
# If loading from a local file
ds = load_dataset("json", data_files="msmarco_tr_hard_negatives_final.jsonl", split="train")
print(ds[0])
```
### Training with Sentence Transformers
This dataset is optimized for loss functions like `MultipleNegativesRankingLoss` or `InfoNCE`.
```python
from sentence_transformers import InputExample
train_examples = []
for row in ds:
# Structure: [Query, Positive, Negative1, Negative2, ...]
texts = [row['query'], row['pos']] + row['negatives']
train_examples.append(InputExample(texts=texts))
# Note: Ensure the Loss function you are using supports multiple negatives per example.
```
-----
## Limitations & Bias
1. **Translation Errors:** The original `parsak/msmarco-tr` dataset was created via machine translation from English. Therefore, some Turkish expressions may not be natural or may contain translation errors.
2. **False Negatives:** Despite the `0.98` similarity filter, there is a possibility that some passages selected as "negatives" are actually correct answers that were not labeled in the original dataset.
3. **Model Bias:** The negatives were mined using the `emrecan/bert-base-turkish` model. The dataset naturally reflects the biases and semantic understanding of this base model.
## Citation
If you use this dataset, please cite Lumees AI, the original MS MARCO authors, and the Turkish translation source as follows:
```bibtex
@misc{lumees_msmarco_hn_2025,
author = {Lumees AI and Kurşun, Hasan and Yanık, Kerem Berkay},
title = {MS MARCO TR - Hard Negatives Dataset},
year = {2025},
howpublished = {\url{[https://lumees.io](https://lumees.io)}},
}
@article{bajaj2016ms,
title={MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others},
journal={arXiv preprint arXiv:1611.09268},
year={2016}
}
@misc{parsak_msmarco_tr,
author = {Parsak},
title = {MS MARCO Turkish Translation},
year = {2023},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\url{[https://huggingface.co/datasets/parsak/msmarco-tr](https://huggingface.co/datasets/parsak/msmarco-tr)}}
}
```
提供机构:
lumees



