five

reasonrank_data_13k

收藏
魔搭社区2025-11-20 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/lwhlwh/reasonrank_data_13k
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="left"> Useful links: 📝 <a href="https://arxiv.org/abs/2508.07050" target="_blank">arXiv Paper</a> • </a> 🧩 <a href="https://github.com/8421BCD/ReasonRank" target="_blank">Github</a> </p> This is the whole training set (13k) of paper: ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability. The dataset fields of ``training_data_all.jsonl`` are as follows: #### **Dataset Fields & Descriptions** 1. **`dataset`** *(str)* - The dataset name of each piece of data (e.g., `"math-qa"`). 2. **`qid`** *(str)* - The query ID. The content is provided in ``id_query/`` directory. 3. **`initial_list`** *(List[str])* - The initial list of passage IDs before DeepSeek-R1 reranking. The content of each passage ID is provided in ``id_doc/`` directory. 4. **`final_list`** *(List[str])* - The re-ranked list of passage IDs after listwisely reranking with DeepSeek-R1. 5. **`reasoning`** *(str)* - A **step-by-step reasoning chain** outputted by DeepSeek-R1 while performing the listwise reranking. 6. **`relevant_docids`** *(List[str])* - The ids of relevant passages in ``initial_list`` mined by DeepSeek-R1. The remaining passage ids in ``initial_list`` are irrelevant ones. - Note that **`relevant_docids`** are not necessarily ranked at the top of **`final_list`** by the DeepSeek-R1, which may stem from inconsistencies in DeepSeek-R1’s judgments. To address this, you can apply the **self-consistency data filtering** technique proposed in our paper to further select higher-quality data. The statistics of dataset is shown in the figure below: <p align="center"> <img width="80%" alt="image" src="https://github.com/user-attachments/assets/c04b9d1a-2f21-46f1-b23d-ad1f50d22fb8" /> </p> #### **Example Entry** ```json { "dataset": "math-qa", "qid": "math_1001", "initial_list": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", ...], "final_list": ["math_test_intermediate_algebra_808", "math_test_intermediate_algebra_1678", ...], "reasoning": "Okay, I need to rank the 20 passages based on their relevance...", "relevant_docids": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", "math_train_intermediate_algebra_993"] } ``` #### **Application** 1. Training passage reranker: Given the reranked passage list, one can use our data to train a listwise reranker 2. Training passage retriever: Using the **`relevant_docids`** and the remaining irrelevant ids, one can train a passage retriever.

<p align="left"> 有用链接:📝 <a href="https://arxiv.org/abs/2508.07050" target="_blank">arXiv论文</a> • 🧩 <a href="https://github.com/8421BCD/ReasonRank" target="_blank">GitHub仓库</a> </p> 本数据集为论文《ReasonRank: 以强大推理能力赋能段落排序》的完整训练集(共13k条数据)。 `training_data_all.jsonl`的数据集字段说明如下: #### 数据集字段及说明 1. **`dataset`**(字符串类型):每条数据所属的数据集名称(例如:`"math-qa"`)。 2. **`qid`**(字符串类型):查询ID,其具体内容存放在`id_query/`目录下。 3. **`initial_list`**(字符串列表类型):DeepSeek-R1重排序前的初始段落ID列表,每个段落ID的具体内容存放在`id_doc/`目录下。 4. **`final_list`**(字符串列表类型):经DeepSeek-R1进行列表式重排序后得到的重排段落ID列表。 5. **`reasoning`**(字符串类型):DeepSeek-R1在执行列表式重排序过程中输出的**逐步推理链**。 6. **`relevant_docids`**(字符串列表类型):由DeepSeek-R1从`initial_list`中挖掘出的相关段落ID,`initial_list`中剩余的段落ID即为无关段落。 注意:**`relevant_docids`**未必会被DeepSeek-R1排在**`final_list`**的靠前位置,这可能源于DeepSeek-R1判断的不一致性。针对该问题,可采用本文提出的**自一致性数据过滤**技术进一步筛选高质量数据。 数据集的统计信息如下图所示: <p align="center"> <img width="80%" alt="image" src="https://github.com/user-attachments/assets/c04b9d1a-2f21-46f1-b23d-ad1f50d22fb8" /> </p> #### 示例数据条目 json { "dataset": "math-qa", "qid": "math_1001", "initial_list": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", ...], "final_list": ["math_test_intermediate_algebra_808", "math_test_intermediate_algebra_1678", ...], "reasoning": "Okay, I need to rank the 20 passages based on their relevance...", "relevant_docids": ["math_test_intermediate_algebra_808", "math_train_intermediate_algebra_1471", "math_train_intermediate_algebra_993"] } #### 应用场景 1. 训练段落重排序模型:基于重排后的段落列表,可使用本数据集训练列表式段落重排序模型。 2. 训练段落检索模型:利用**`relevant_docids`**及剩余的无关段落ID,可训练段落检索模型。
提供机构:
maas
创建时间:
2025-08-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作