Vladimirlv/ru-promptriever-dataset
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Vladimirlv/ru-promptriever-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ru
license: cc-by-nc-4.0
task_categories:
- text-retrieval
task_ids:
- document-retrieval
tags:
- retrieval
- dense-retrieval
- instruction-following
- promptriever
- russian
- mmarco
- information-retrieval
pretty_name: RuPromptriever Dataset
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: validation
path: data/val-*.parquet
- split: test
path: data/test-*.parquet
---
# RuPromptriever Dataset
A Russian-language instruction-following retrieval dataset for training bi-encoder models that can follow natural language constraints at query time — for example, *"Find documents about X, but exclude those mentioning Y."*
Built following the methodology of [Promptriever (Weller et al., 2024)](https://arxiv.org/abs/2409.11136) and adapted to Russian using the [mMARCO-ru](https://huggingface.co/datasets/unicamp-dl/mmarco) passage corpus (~8.8M passages).
---
## Motivation
Standard dense retrieval models match queries to documents purely by semantic similarity. **Promptriever** extends this paradigm: the retriever learns to interpret and obey natural language instructions appended to a query, enabling fine-grained controlled retrieval without model retraining or structural changes.
This dataset provides the Russian-language training signal for that capability.
---
## Dataset Construction Pipeline
Each example was generated through a multi-stage LLM pipeline operating on top of mMARCO-ru triples (query, positive passage, negative passage):
### Stage 1 — Rewriting & Instruction Generation
The raw mMARCO-ru data is machine-translated from English and often contains broken grammar, word-order errors, and translation artifacts. A large language model was used to:
1. **Rewrite** the query and both passages into natural, fluent Russian.
2. **Generate a retrieval instruction** — a natural language constraint that makes exactly one of the two passages relevant and the other irrelevant. Instructions were sampled with randomized length (short to very long) and style (background context, persona, negation, detailed criteria), producing a diverse constraint distribution.
### Stage 2 — Instruction Negative Mining
After generating the instruction, the same LLM was asked to synthesize **instruction negative passages** — documents that appear topically related to the query but violate the generated instruction. Each batch of 4 documents contained:
- 1 **positive** passage matching both the query and the instruction (`error_type: none`)
- 3 **instruction negatives**, each demonstrating a distinct failure mode:
- `different_interpretation` — uses an alternative meaning of the query term that contradicts the instruction context
- `omission` — looks like an ideal answer but is missing a key element required by the instruction
- `mention_non_relevant_flag` — explicitly contains content that the instruction prohibits
### Stage 3 — LLM-Based Filtering
All generated triplets were validated by an LLM judge. For each record the judge verified that:
- The assigned positive passage genuinely satisfies the instruction.
- Each synthetic negative genuinely fails to satisfy the instruction.
- There are no factual hallucinations or leakage of the answer into the instruction.
Records that failed validation were discarded. To maintain the target dataset volume, a second generation pass was performed to replace rejected samples.
### Stage 4 — BM25 Hard Negative Mining
For every query (standard and instruction-augmented), top-k BM25 candidates were retrieved from the full mMARCO-ru corpus and used as hard negatives, stored in `negative_passages`.
### Stage 5 — Query Paraphrasing Mix
To reduce out-of-distribution noise from machine translation, each query is randomly sampled 50/50 from the original mMARCO-ru query and the LLM-rewritten paraphrase.
---
## Instruction–noInstruction Pairing
Every source query produces **two rows** in the dataset:
- A **standard retrieval** row (`has_instruction: false`) — no instruction, uses the original positive passage and BM25 hard negatives.
- An **instruction-following** row (`has_instruction: true`) — query + instruction appended, uses the rewritten positive passage and instruction negatives from `new_negatives`.
This 1:1 pairing prevents catastrophic forgetting: the model learns instruction-following without losing standard retrieval ability.
---
## Schema
| Column | Type | Description |
|---|---|---|
| `query_id` | `string` | Unique row ID. Instruction rows have a `-instruct` suffix. |
| `query` | `string` | Full query text. For instruction rows: `only_query + " " + only_instruction`. |
| `positive_passages` | `list[{docid, text, title}]` | Relevant passage(s) for this row. |
| `negative_passages` | `list[{docid, text, title, explanation}]` | BM25 hard negatives (no synthetic instruction negatives here). |
| `only_instruction` | `string` | Instruction text in isolation. Empty string for non-instruction rows. |
| `only_query` | `string` | Base query text without any instruction. |
| `has_instruction` | `bool` | `true` if this row is an instruction-following example. |
| `new_negatives` | `list[{docid, text, title, explanation}]` | LLM-generated synthetic instruction negatives. The `explanation` field contains the `error_type` value (`different_interpretation`, `omission`, `mention_non_relevant_flag`). |
| `is_repeated` | `bool` | `true` if this query ID appears more than once in the source mMARCO data (multiple relevant documents). By default these rows should be excluded from training to avoid label noise. |
---
## Splits
| Split | Description |
|---|---|
| `train` | Main training set (unique query IDs, no overlap with val/test). |
| `validation` | Held-out validation set (unique query IDs). |
| `test` | Held-out test set (unique query IDs). |
---
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Vladimirlv/ru-promptriever-dataset")
# Recommended: exclude repeated-query rows to avoid label noise
clean_train = ds["train"].filter(lambda x: not x["is_repeated"])
# Split by row type
instruct_rows = clean_train.filter(lambda x: x["has_instruction"]) # query + instruction
standard_rows = clean_train.filter(lambda x: not x["has_instruction"]) # query only
# Accessing passage text
row = clean_train[0]
print(row["query"])
print(row["positive_passages"][0]["text"]) # positive passage text
print(row["negative_passages"][0]["text"]) # hard BM25 negative text
print(row["new_negatives"][0]["text"]) # synthetic instruction negative text
```
---
## Intended Use
- Training Russian instruction-following dense retrieval models (bi-encoders).
- Evaluating retrieval models on their ability to follow natural language constraints.
- Research on multilingual Promptriever-style systems.
## Out-of-Scope Use
This dataset should not be used for training general-purpose language models. Evaluation on standard retrieval benchmarks should be done separately (e.g., on [ruMTEB](https://huggingface.co/datasets/ai-forever/ru-mteb), [mFollowIR](https://huggingface.co/datasets/jhu-clsp/mFollowIR)).
---
## Limitations
- The passage corpus is derived from MS MARCO — primarily English web passages machine-translated to Russian. Translation quality affects a portion of examples.
- Instructions and synthetic negatives were generated and filtered automatically by an LLM. A small fraction of noisy examples may remain despite filtering.
- The dataset covers Russian only.
- Due to the non-commercial license of MS MARCO, this dataset is released for **research and non-commercial use only**.
---
## License
This dataset is released under **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International), consistent with the non-commercial license of the underlying [MS MARCO](https://microsoft.github.io/msmarco/) corpus.
---
## Citation
If you use this dataset, please cite the original Promptriever paper it is based on:
```bibtex
@article{weller2024promptriever,
title = {Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
author = {Weller, Orion and Lawrie, Dawn and Van Durme, Benjamin and others},
journal = {arXiv preprint arXiv:2409.11136},
year = {2024}
}
```
language:
- 俄语
license: CC BY-NC 4.0
task_categories:
- 文本检索
task_ids:
- 文档检索
tags:
- 检索
- 密集检索
- 指令遵循
- Promptriever
- 俄语
- mMARCO
- 信息检索
pretty_name: RuPromptriever数据集
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: validation
path: data/val-*.parquet
- split: test
path: data/test-*.parquet
---
# RuPromptriever数据集
本数据集为俄语语言的指令遵循型检索数据集,用于训练可在查询阶段遵循自然语言约束的双编码器模型,例如「查找关于X的文档,但排除提及Y的内容」。
本数据集遵循[Promptriever(Weller等人,2024)](https://arxiv.org/abs/2409.11136)的方法构建,并依托[mMARCO-ru](https://huggingface.co/datasets/unicamp-dl/mmarco)段落语料库(约880万个段落)适配至俄语场景。
---
## 研究动机
标准的密集检索模型仅通过语义相似度将查询与文档进行匹配。**Promptriever**拓展了这一范式:检索模型可学习理解并遵循附加于查询后的自然语言指令,实现细粒度的可控检索,无需对模型进行重新训练或结构修改。
本数据集为该能力提供俄语语言的训练信号。
---
## 数据集构建流程
每个样本均基于mMARCO-ru三元组(查询、正样本段落、负样本段落),通过多阶段大语言模型(LLM)流程生成。
### 阶段1 — 改写与指令生成
原始mMARCO-ru数据由英文机器翻译而来,常存在语法残缺、语序错误及翻译瑕疵。大语言模型被用于完成以下两项任务:
1. **改写**查询与两段段落,使其成为自然流畅的俄语文本。
2. **生成检索指令**——一种自然语言约束,可恰好令两段段落中的一段相关、另一段不相关。指令的采样随机化了长度(从简短至极长)与风格(背景语境、角色设定、否定要求、详细准则),从而形成多样化的约束分布。
### 阶段2 — 指令负样本挖掘
生成指令后,同一大语言模型被要求合成**指令负样本段落**——即与查询主题相关但违反所生成指令的文档。每批次4个文档包含:
- 1个**正样本段落**,同时匹配查询与指令(`error_type: none`)
- 3个**指令负样本**,分别对应三种不同的失败模式:
- `different_interpretation`:采用与指令上下文矛盾的查询术语替代含义
- `omission`:看似为理想答案,但缺失指令要求的关键要素
- `mention_non_relevant_flag`:明确包含指令所禁止的内容
### 阶段3 — 基于大语言模型的过滤
所有生成的三元组均由大语言模型评审进行验证。评审会针对每条记录确认以下几点:
- 分配的正样本段落确实满足指令要求
- 每个合成负样本确实无法满足指令要求
- 不存在事实幻觉或答案泄露至指令中的情况
未通过验证的记录将被丢弃。为维持目标数据集规模,会执行第二轮生成流程以替换被拒样本。
### 阶段4 — BM25难负样本挖掘
针对每个查询(包括标准查询与带指令的查询),从完整的mMARCO-ru语料库中检索top-k BM25候选样本作为难负样本,存储于`negative_passages`字段。
### 阶段5 — 查询释义混合
为降低机器翻译带来的分布外噪声,每个查询会以50/50的概率从原始mMARCO-ru查询与大语言模型改写后的释义中随机选取。
---
## 指令-无指令配对
每个源查询会在数据集中生成**两条样本行**:
- 一条**标准检索样本行**(`has_instruction: false`):无指令,使用原始正样本段落与BM25难负样本。
- 一条**指令遵循样本行**(`has_instruction: true`):查询+附加指令,使用改写后的正样本段落与`new_negatives`中的指令负样本。
这种1:1配对可防止灾难性遗忘:模型可在学习指令遵循能力的同时,不丧失标准检索能力。
---
## 数据Schema
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `query_id` | `string` | 唯一行标识符。带指令的样本行带有`-instruct`后缀。 |
| `query` | `string` | 完整查询文本。对于带指令的样本行:`only_query + " " + only_instruction`。 |
| `positive_passages` | `list[{docid, text, title}]` | 当前样本对应的相关段落。 |
| `negative_passages` | `list[{docid, text, title, explanation}]` | BM25难负样本(此处不包含合成的指令负样本)。 |
| `only_instruction` | `string` | 独立的指令文本。对于无指令的样本行,该字段为空字符串。 |
| `only_query` | `string` | 未附加任何指令的基础查询文本。 |
| `has_instruction` | `bool` | 若当前样本为指令遵循示例,则为`true`。 |
| `new_negatives` | `list[{docid, text, title, explanation}]` | 大语言模型生成的合成指令负样本。`explanation`字段包含`error_type`值(`different_interpretation`、`omission`、`mention_non_relevant_flag`)。 |
| `is_repeated` | `bool` | 若该查询ID在源mMARCO数据中出现多次(存在多个相关文档),则为`true`。默认情况下应将此类样本从训练集中排除,以避免标签噪声。 |
---
## 数据集划分
| 划分 | 描述 |
|---|---|
| `train` | 主训练集(唯一查询ID,与验证集、测试集无重叠)。 |
| `validation` | 预留验证集(唯一查询ID)。 |
| `test` | 预留测试集(唯一查询ID)。 |
---
## 使用示例
python
from datasets import load_dataset
ds = load_dataset("Vladimirlv/ru-promptriever-dataset")
# 推荐操作:排除重复查询的样本,以避免标签噪声
clean_train = ds["train"].filter(lambda x: not x["is_repeated"])
# 按样本类型划分
instruct_rows = clean_train.filter(lambda x: x["has_instruction"]) # 查询+指令
standard_rows = clean_train.filter(lambda x: not x["has_instruction"]) # 仅查询
# 访问段落文本
row = clean_train[0]
print(row["query"])
print(row["positive_passages"][0]["text"]) # 正样本段落文本
print(row["negative_passages"][0]["text"]) # BM25难负样本文本
print(row["new_negatives"][0]["text"]) # 合成指令负样本文本
---
## 预期用途
- 训练俄语指令遵循型密集检索模型(双编码器)。
- 评估检索模型遵循自然语言约束的能力。
- 针对多语言Promptriever风格系统的研究。
## 不适用场景
本数据集不得用于训练通用型语言模型。在标准检索基准上的评估应单独进行(例如使用[ruMTEB](https://huggingface.co/datasets/ai-forever/ru-mteb)、[mFollowIR](https://huggingface.co/datasets/jhu-clsp/mFollowIR))。
---
## 局限性
- 段落语料库源自MS MARCO——主要为机器翻译为俄语的英文网页段落。翻译质量会影响部分样本。
- 指令与合成负样本均由大语言模型自动生成并过滤。尽管经过筛选,仍可能存在少量带有噪声的样本。
- 本数据集仅覆盖俄语语言。
- 由于MS MARCO采用非商业许可证,本数据集仅可用于**研究与非商业用途**。
---
## 许可证
本数据集采用**CC BY-NC 4.0**(知识共享署名-非商业性使用4.0国际许可协议)发布,与底层[MS MARCO](https://microsoft.github.io/msmarco/)语料库的非商业许可证保持一致。
---
## 引用
若使用本数据集,请引用其依托的原始Promptriever论文:
bibtex
@article{weller2024promptriever,
title = {Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
author = {Weller, Orion and Lawrie, Dawn and Van Durme, Benjamin and others},
journal = {arXiv preprint arXiv:2409.11136},
year = {2024}
}
提供机构:
Vladimirlv



