reasonrank_data_13k
收藏魔搭社区2025-11-20 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/lwhlwh/reasonrank_data_13k
下载链接
链接失效反馈官方服务:
资源简介:
<p align="left">
Useful links: 📝 <a href="https://arxiv.org/abs/2508.07050" target="_blank">arXiv Paper</a> • </a> 🧩 <a href="https://github.com/8421BCD/ReasonRank" target="_blank">Github</a>
</p>
This is the whole training set (13k) of paper: ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability.
The dataset fields of ``training_data_all.jsonl`` are as follows:
#### **Dataset Fields & Descriptions**
1. **`dataset`** *(str)*
- The dataset name of each piece of data (e.g., `"math-qa"`).
2. **`qid`** *(str)*
- The query ID. The content is provided in ``id_query/`` directory.
3. **`initial_list`** *(List[str])*
- The initial list of passage IDs before DeepSeek-R1 reranking. The content of each passage ID is provided in ``id_doc/`` directory.
4. **`final_list`** *(List[str])*
- The re-ranked list of passage IDs after listwisely reranking with DeepSeek-R1.
5. **`reasoning`** *(str)*
- A **step-by-step reasoning chain** outputted by DeepSeek-R1 while performing the listwise reranking.
6. **`relevant_docids`** *(List[str])*
- The ids of relevant passages in ``initial_list`` mined by DeepSeek-R1. The remaining passage ids in ``initial_list`` are irrelevant ones.
- Note that **`relevant_docids`** are not necessarily ranked at the top of **`final_list`** by the DeepSeek-R1, which may stem from inconsistencies in DeepSeek-R1’s judgments. To address this, you can apply the **self-consistency data filtering** technique proposed in our paper to further select higher-quality data.
The statistics of dataset is shown in the figure below:
<p align="center">
<img width="80%" alt="image" src="https://github.com/user-attachments/assets/c04b9d1a-2f21-46f1-b23d-ad1f50d22fb8" />
</p>
#### **Example Entry**
```json
{
"dataset": "math-qa",
"qid": "math_1001",
"initial_list": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", ...],
"final_list": ["math_test_intermediate_algebra_808", "math_test_intermediate_algebra_1678", ...],
"reasoning": "Okay, I need to rank the 20 passages based on their relevance...",
"relevant_docids": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", "math_train_intermediate_algebra_993"]
}
```
#### **Application**
1. Training passage reranker: Given the reranked passage list, one can use our data to train a listwise reranker
2. Training passage retriever: Using the **`relevant_docids`** and the remaining irrelevant ids, one can train a passage retriever.
<p align="left">
有用链接:📝 <a href="https://arxiv.org/abs/2508.07050" target="_blank">arXiv论文</a> • 🧩 <a href="https://github.com/8421BCD/ReasonRank" target="_blank">GitHub仓库</a>
</p>
本数据集为论文《ReasonRank: 以强大推理能力赋能段落排序》的完整训练集(共13k条数据)。
`training_data_all.jsonl`的数据集字段说明如下:
#### 数据集字段及说明
1. **`dataset`**(字符串类型):每条数据所属的数据集名称(例如:`"math-qa"`)。
2. **`qid`**(字符串类型):查询ID,其具体内容存放在`id_query/`目录下。
3. **`initial_list`**(字符串列表类型):DeepSeek-R1重排序前的初始段落ID列表,每个段落ID的具体内容存放在`id_doc/`目录下。
4. **`final_list`**(字符串列表类型):经DeepSeek-R1进行列表式重排序后得到的重排段落ID列表。
5. **`reasoning`**(字符串类型):DeepSeek-R1在执行列表式重排序过程中输出的**逐步推理链**。
6. **`relevant_docids`**(字符串列表类型):由DeepSeek-R1从`initial_list`中挖掘出的相关段落ID,`initial_list`中剩余的段落ID即为无关段落。
注意:**`relevant_docids`**未必会被DeepSeek-R1排在**`final_list`**的靠前位置,这可能源于DeepSeek-R1判断的不一致性。针对该问题,可采用本文提出的**自一致性数据过滤**技术进一步筛选高质量数据。
数据集的统计信息如下图所示:
<p align="center">
<img width="80%" alt="image" src="https://github.com/user-attachments/assets/c04b9d1a-2f21-46f1-b23d-ad1f50d22fb8" />
</p>
#### 示例数据条目
json
{
"dataset": "math-qa",
"qid": "math_1001",
"initial_list": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", ...],
"final_list": ["math_test_intermediate_algebra_808", "math_test_intermediate_algebra_1678", ...],
"reasoning": "Okay, I need to rank the 20 passages based on their relevance...",
"relevant_docids": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", "math_train_intermediate_algebra_993"]
}
#### 应用场景
1. 训练段落重排序模型:基于重排后的段落列表,可使用本数据集训练列表式段落重排序模型。
2. 训练段落检索模型:利用**`relevant_docids`**及剩余的无关段落ID,可训练段落检索模型。
提供机构:
maas
创建时间:
2025-08-15



