LocalDoc/msmarco-az-reranked
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/msmarco-az-reranked
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- az
- en
license: ms-pl
task_categories:
- text-retrieval
tags:
- retrieval
- azerbaijani
- information-retrieval
- hard-negatives
- reranker
- ms-marco
- dense-retrieval
- colbert
- bi-encoder
- translated
size_categories:
- 1M<n<10M
configs:
- config_name: corpus
data_files:
- split: train
path: corpus/train-*
- config_name: queries
data_files:
- split: train
path: queries/train-*
- config_name: triplets
data_files:
- split: train
path: triplets/train-*
---
# MS MARCO Azerbaijani — Reranked Retrieval Training Dataset
A large-scale passage retrieval training dataset in Azerbaijani, built by translating a 3.2M subset of the [MS MARCO](https://microsoft.github.io/msmarco/) passage ranking dataset and rescoring all query-passage pairs with a multilingual cross-encoder reranker.
## Overview
| | Count |
|---|---|
| Passages | 8,473,865 |
| Queries | ~800,000 |
| Triplets | ~3,200,000 |
| Negatives per triplet | up to 31 |
| Total pairs scored | 41,746,530 |
## Dataset Configs
The dataset consists of three configs:
### `corpus`
The full translated passage collection.
| Column | Type | Description |
|---|---|---|
| `pid` | int | Passage ID (original MS MARCO pid) |
| `passage` | string | Passage text translated to Azerbaijani |
### `queries`
Translated queries.
| Column | Type | Description |
|---|---|---|
| `qid` | int | Query ID (original MS MARCO qid) |
| `query` | string | Query text translated to Azerbaijani |
### `triplets`
Training triplets with both original MS MARCO scores and reranker scores computed on the Azerbaijani translations.
| Column | Type | Description |
|---|---|---|
| `qid` | int | Query ID (links to `queries`) |
| `pos_pid` | int | Positive passage ID (links to `corpus`) |
| `pos_score_original` | float | Original MS MARCO cross-encoder score (English) |
| `pos_score_reranker` | float | Reranker score on Azerbaijani translation |
| `neg_count` | int | Number of valid negatives for this triplet |
| `neg_{k}_pid` | int | Passage ID of the k-th hard negative |
| `neg_{k}_score_original` | float | Original MS MARCO score of the k-th negative |
| `neg_{k}_score_reranker` | float | Reranker score of the k-th negative (Azerbaijani) |
Negatives are sorted by `score_reranker` descending (hardest first). Columns run from `neg_1_*` to `neg_31_*`.
## Construction Pipeline
1. **Sampling**: 3.2M triplets were sampled from the MS MARCO `examples.json` using reservoir sampling, with 31 negatives selected per query
2. **Translation**: All queries and passages were translated from English to Azerbaijani
3. **Reranking**: Every query-passage pair (positive + all negatives) was scored with [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) on the Azerbaijani translations (~14 hours, 41.7M pairs scored)
4. **Output**: Triplets with dual scores (original English + Azerbaijani reranker) to enable flexible filtering during training
## Why Reranker Scores?
The original MS MARCO scores were computed on English text. After translation, semantic relationships between queries and passages can shift — some negatives become closer to the positive, and some positives become weaker. The reranker scores on Azerbaijani text reflect what the model will actually see during training.
This also enables **false negative filtering**: negatives with `score_reranker > threshold * pos_score_reranker` are likely correct answers that MS MARCO did not annotate. These can be filtered out during training to avoid noisy supervision signals.
## Usage
```python
from datasets import load_dataset
corpus = load_dataset("LocalDoc/msmarco-az-reranked", "corpus")["train"]
queries = load_dataset("LocalDoc/msmarco-az-reranked", "queries")["train"]
triplets = load_dataset("LocalDoc/msmarco-az-reranked", "triplets")["train"]
# Build lookups
passage_lookup = {row["pid"]: row["passage"] for row in corpus}
query_lookup = {row["qid"]: row["query"] for row in queries}
# Inspect a triplet
t = triplets[0]
print(f"Query: {query_lookup[t['qid']]}")
print(f"Positive [reranker={t['pos_score_reranker']:.4f}]: {passage_lookup[t['pos_pid']][:200]}")
for k in range(1, 4):
neg_pid = t[f"neg_{k}_pid"]
neg_score = t[f"neg_{k}_score_reranker"]
if neg_pid:
print(f"Neg-{k} [reranker={neg_score:.4f}]: {passage_lookup[neg_pid][:200]}")
```
### Training with False Negative Filtering
```python
# Filter out false negatives where negative score > 95% of positive score
FN_THRESHOLD = 0.95
t = triplets[0]
pos_score = t["pos_score_reranker"]
cutoff = FN_THRESHOLD * pos_score
clean_negs = []
for k in range(1, 32):
neg_pid = t[f"neg_{k}_pid"]
neg_score = t[f"neg_{k}_score_reranker"]
if neg_pid and neg_score < cutoff:
clean_negs.append((neg_pid, neg_score))
print(f"Original negatives: {t['neg_count']}")
print(f"After FN filtering: {len(clean_negs)}")
```
## Example Output
```
Query: Dişi aslanlar nə qədər doğurur
Positive [original=10.41, reranker=5.64]:
Dişi şir normalda hər 18-26 aydan bir doğur. Təxminən 100-119 günlük
hamiləlik dövründən sonra bir-altı bala doğur. Lakin, balaların sayı
adətən üç və ya dörd olur və hər birinin çəkisi təxminən 3 funt olur.
Neg-1 [original=9.26, reranker=7.41]: ← false negative (reranker > positive)
Dişi aslanlar adətən hər iki ildən bir bala doğurlar. Dişilər hamilə
və ya əmizdirən deyillərsə, ildə bir neçə dəfə cütləşməyə hazırdırlar.
Neg-2 [original=9.35, reranker=5.41]:
Pride-ın dişi hissəsi bütün yetkinlik həyatlarını birlikdə yaşayır,
lakin erkəklər gəlib-gedir. Dişi aslanın hamiləliyi təxminən dörd ay
davam edir.
Neg-3 [original=3.27, reranker=2.77]: ← true negative
At: Dişilərin hamiləliyi adətən 11-12 ay çəkir. Dəniz aslanı: Dəniz
şirləri də balalarını 11-12 aylıq hamiləlik dövründən sonra dünyaya
gətirirlər.
```
## Limitations
- Passages and queries are machine-translated; translation artifacts (lexical mismatch, semantic drift) may affect quality
- Reranker scores are from a multilingual model that may underperform on Azerbaijani compared to English
- Original MS MARCO annotations are incomplete — some "negatives" are actually relevant (false negatives)
## Contact
For questions or issues, please contact LocalDoc at [v.resad.89@gmail.com].
---
语言:
- 阿塞拜疆语 (az)
- 英语 (en)
许可证: ms-pl
任务类别:
- 文本检索 (text-retrieval)
标签:
- 检索 (retrieval)
- 阿塞拜疆语 (azerbaijani)
- 信息检索 (information-retrieval)
- 难负样本 (hard-negatives)
- 重排序器 (reranker)
- MS MARCO
- 稠密检索 (dense-retrieval)
- Colbert (colbert)
- 双编码器 (bi-encoder)
- 翻译后 (translated)
样本规模类别:
- 百万级<样本数<千万级
配置项:
- 配置名称: corpus
数据文件:
- 拆分方式: train
文件路径: corpus/train-*
- 配置名称: queries
数据文件:
- 拆分方式: train
文件路径: queries/train-*
- 配置名称: triplets
数据文件:
- 拆分方式: train
文件路径: triplets/train-*
---
# MS MARCO 阿塞拜疆语版 — 重排序检索训练数据集
本数据集为大规模阿塞拜疆语篇章检索训练数据集,通过翻译[MS MARCO](https://microsoft.github.io/msmarco/)篇章排序数据集的320万个子集,并使用多语言交叉编码器重排序器(reranker)对所有查询-篇章对重新打分构建而成。
## 数据集概览
| 类别 | 数量 |
|---|---|
| 篇章 | 8,473,865 |
| 查询 | ~800,000 |
| 训练三元组 | ~3,200,000 |
| 每个三元组的负样本数 | 最多31 |
| 总打分对数 | 41,746,530 |
## 数据集配置
本数据集包含三个配置项:
### `corpus`
完整的翻译后篇章集合。
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `pid` | int | 篇章ID(原始MS MARCO篇章ID) |
| `passage` | string | 译为阿塞拜疆语的篇章文本 |
### `queries`
翻译后的查询。
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `qid` | int | 查询ID(原始MS MARCO查询ID) |
| `query` | string | 译为阿塞拜疆语的查询文本 |
### `triplets`
包含原始MS MARCO打分与阿塞拜疆语翻译版本重排序器(reranker)打分的训练三元组。
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `qid` | int | 查询ID(关联至`queries`配置) |
| `pos_pid` | int | 正例篇章ID(关联至`corpus`配置) |
| `pos_score_original` | float | 原始MS MARCO交叉编码器打分(基于英语原文) |
| `pos_score_reranker` | float | 阿塞拜疆语翻译版本的重排序器打分 |
| `neg_count` | int | 该三元组的有效负样本数量 |
| `neg_{k}_pid` | int | 第k个难负样本的篇章ID |
| `neg_{k}_score_original` | float | 第k个负样本的原始MS MARCO打分 |
| `neg_{k}_score_reranker` | float | 第k个负样本的重排序器打分(基于阿塞拜疆语翻译) |
负样本按`score_reranker`降序排列(难度从高到低),列名范围为`neg_1_*`至`neg_31_*`。
## 构建流程
1. **采样**:通过蓄水池采样从MS MARCO的`examples.json`中抽取320万条三元组,每个查询搭配31个负样本。
2. **翻译**:将所有查询与篇章从英语译为阿塞拜疆语。
3. **重排序打分**:使用[BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)对阿塞拜疆语版本的所有查询-篇章对(正例+所有负样本)进行打分,总耗时约14小时,共完成4170万对打分。
4. **输出**:生成包含双打分(原始英语打分与阿塞拜疆语重排序打分)的三元组,便于训练过程中灵活过滤样本。
## 为何使用重排序打分?
原始MS MARCO打分基于英语文本计算。经过翻译后,查询与篇章之间的语义关联可能发生变化:部分负样本与正例的相似度提升,部分正例的相关性减弱。基于阿塞拜疆语文本的重排序打分,更贴合模型在训练阶段实际会接触到的输入数据。
同时该设计也支持**假负样本过滤**:当负样本的`score_reranker`大于阈值乘以正例`score_reranker`时,该负样本大概率是MS MARCO未标注的相关答案,可在训练阶段将其过滤,避免引入带有噪声的监督信号。
## 使用方法
以下为示例代码:
python
from datasets import load_dataset
corpus = load_dataset("LocalDoc/msmarco-az-reranked", "corpus")["train"]
queries = load_dataset("LocalDoc/msmarco-az-reranked", "queries")["train"]
triplets = load_dataset("LocalDoc/msmarco-az-reranked", "triplets")["train"]
# 构建查找字典
passage_lookup = {row["pid"]: row["passage"] for row in corpus}
query_lookup = {row["qid"]: row["query"] for row in queries}
# 查看一个三元组示例
t = triplets[0]
print(f"Query: {query_lookup[t['qid']]}")
print(f"Positive [reranker={t['pos_score_reranker']:.4f}]: {passage_lookup[t['pos_pid']][:200]}")
for k in range(1, 4):
neg_pid = t[f"neg_{k}_pid"]
neg_score = t[f"neg_{k}_score_reranker"]
if neg_pid:
print(f"Neg-{k} [reranker={neg_score:.4f}]: {passage_lookup[neg_pid][:200]}")
### 结合假负样本过滤的训练方式
python
# 过滤负样本打分高于正例打分95%的假负样本
FN_THRESHOLD = 0.95
t = triplets[0]
pos_score = t["pos_score_reranker"]
cutoff = FN_THRESHOLD * pos_score
clean_negs = []
for k in range(1, 32):
neg_pid = t[f"neg_{k}_pid"]
neg_score = t[f"neg_{k}_score_reranker"]
if neg_pid and neg_score < cutoff:
clean_negs.append((neg_pid, neg_score))
print(f"Original negatives: {t['neg_count']}")
print(f"After FN filtering: {len(clean_negs)}")
## 示例输出
Query: Dişi aslanlar nə qədər doğurur
Positive [original=10.41, reranker=5.64]:
Dişi şir normalda hər 18-26 aydan bir doğur. Təxminən 100-119 günlük
hamiləlik dövründən sonra bir-altı bala doğur. Lakin, balaların sayı
adətən üç və ya dörd olur və hər birinin çəkisi təxminən 3 funt olur.
Neg-1 [original=9.26, reranker=7.41]: ← false negative (reranker > positive)
Dişi aslanlar adətən hər iki ildən bir bala doğurlar. Dişilər hamilə
və ya əmizdirən deyillərsə, ildə bir neçə dəfə cütləşməyə hazırdırlar.
Neg-2 [original=9.35, reranker=5.41]:
Pride-ın dişi hissəsi bütün yetkinlik həyatlarını birlikdə yaşayır,
lakin erkəklər gəlib-gedir. Dişi aslanın hamiləliyi təxminən dörd ay
davam edir.
Neg-3 [original=3.27, reranker=2.77]: ← true negative
At: Dişilərin hamiləliyi adətən 11-12 ay çəkir. Dəniz aslanı: Dəniz
şirləri də balalarını 11-12 aylıq hamiləlik dövründən sonra dünyaya
gətirirlər.
## 局限性
- 篇章与查询均为机器翻译结果,翻译 artifacts(词汇不匹配、语义漂移)可能影响数据集质量
- 重排序打分基于多语言模型,该模型在阿塞拜疆语上的表现可能弱于英语
- 原始MS MARCO的标注存在遗漏,部分被标记为“负样本”的篇章实际与查询相关(即假负样本)
## 联系方式
如有疑问或问题,请联系LocalDoc,邮箱:[v.resad.89@gmail.com]
提供机构:
LocalDoc



