reasonrank_data_13k

Name: reasonrank_data_13k
Creator: maas
Published: 2025-11-20 09:33:30
License: 暂无描述

魔搭社区2025-11-20 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/lwhlwh/reasonrank_data_13k

下载链接

链接失效反馈

官方服务：

资源简介：

Useful links: 📝 <a href="https://arxiv.org/abs/2508.07050" target="_blank">arXiv Paper</a> • </a> 🧩 <a href="https://github.com/8421BCD/ReasonRank" target="_blank">Github</a> This is the whole training set (13k) of paper: ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability. The dataset fields of ``training_data_all.jsonl`` are as follows: #### **Dataset Fields & Descriptions** 1. **`dataset`** *(str)* - The dataset name of each piece of data (e.g., `"math-qa"`). 2. **`qid`** *(str)* - The query ID. The content is provided in ``id_query/`` directory. 3. **`initial_list`** *(List[str])* - The initial list of passage IDs before DeepSeek-R1 reranking. The content of each passage ID is provided in ``id_doc/`` directory. 4. **`final_list`** *(List[str])* - The re-ranked list of passage IDs after listwisely reranking with DeepSeek-R1. 5. **`reasoning`** *(str)* - A **step-by-step reasoning chain** outputted by DeepSeek-R1 while performing the listwise reranking. 6. **`relevant_docids`** *(List[str])* - The ids of relevant passages in ``initial_list`` mined by DeepSeek-R1. The remaining passage ids in ``initial_list`` are irrelevant ones. - Note that **`relevant_docids`** are not necessarily ranked at the top of **`final_list`** by the DeepSeek-R1, which may stem from inconsistencies in DeepSeek-R1’s judgments. To address this, you can apply the **self-consistency data filtering** technique proposed in our paper to further select higher-quality data. The statistics of dataset is shown in the figure below: <img width="80%" alt="image" src="https://github.com/user-attachments/assets/c04b9d1a-2f21-46f1-b23d-ad1f50d22fb8" /> #### **Example Entry** ```json { "dataset": "math-qa", "qid": "math_1001", "initial_list": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", ...], "final_list": ["math_test_intermediate_algebra_808", "math_test_intermediate_algebra_1678", ...], "reasoning": "Okay, I need to rank the 20 passages based on their relevance...", "relevant_docids": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", "math_train_intermediate_algebra_993"] } ``` #### **Application** 1. Training passage reranker: Given the reranked passage list, one can use our data to train a listwise reranker 2. Training passage retriever: Using the **`relevant_docids`** and the remaining irrelevant ids, one can train a passage retriever.

有用链接：📝 <a href="https://arxiv.org/abs/2508.07050" target="_blank">arXiv论文</a> • 🧩 <a href="https://github.com/8421BCD/ReasonRank" target="_blank">GitHub仓库</a> 本数据集为论文《ReasonRank: 以强大推理能力赋能段落排序》的完整训练集（共13k条数据）。 `training_data_all.jsonl`的数据集字段说明如下： #### 数据集字段及说明 1. **`dataset`**（字符串类型）：每条数据所属的数据集名称（例如：`"math-qa"`）。 2. **`qid`**（字符串类型）：查询ID，其具体内容存放在`id_query/`目录下。 3. **`initial_list`**（字符串列表类型）：DeepSeek-R1重排序前的初始段落ID列表，每个段落ID的具体内容存放在`id_doc/`目录下。 4. **`final_list`**（字符串列表类型）：经DeepSeek-R1进行列表式重排序后得到的重排段落ID列表。 5. **`reasoning`**（字符串类型）：DeepSeek-R1在执行列表式重排序过程中输出的**逐步推理链**。 6. **`relevant_docids`**（字符串列表类型）：由DeepSeek-R1从`initial_list`中挖掘出的相关段落ID，`initial_list`中剩余的段落ID即为无关段落。注意：**`relevant_docids`**未必会被DeepSeek-R1排在**`final_list`**的靠前位置，这可能源于DeepSeek-R1判断的不一致性。针对该问题，可采用本文提出的**自一致性数据过滤**技术进一步筛选高质量数据。数据集的统计信息如下图所示： <img width="80%" alt="image" src="https://github.com/user-attachments/assets/c04b9d1a-2f21-46f1-b23d-ad1f50d22fb8" /> #### 示例数据条目 json { "dataset": "math-qa", "qid": "math_1001", "initial_list": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", ...], "final_list": ["math_test_intermediate_algebra_808", "math_test_intermediate_algebra_1678", ...], "reasoning": "Okay, I need to rank the 20 passages based on their relevance...", "relevant_docids": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", "math_train_intermediate_algebra_993"] } #### 应用场景 1. 训练段落重排序模型：基于重排后的段落列表，可使用本数据集训练列表式段落重排序模型。 2. 训练段落检索模型：利用**`relevant_docids`**及剩余的无关段落ID，可训练段落检索模型。

提供机构：

maas

创建时间：

2025-08-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集