ReaRAG-20k

Name: ReaRAG-20k
Creator: maas
Published: 2025-12-02 17:18:15
License: 暂无描述

魔搭社区2025-12-02 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/THU-KEG/ReaRAG-20k

下载链接

链接失效反馈

官方服务：

资源简介：

# 📘 Dataset Card for ReaRAG-20k <p align="center"> 🤗 <a href="https://huggingface.co/THU-KEG/ReaRAG-9B" target="_blank">Model</a> • 💻 <a href="https://github.com/THU-KEG/ReaRAG" target="_blank">GitHub</a> • 📃 <a href="https://arxiv.org/abs/2503.21729" target="_blank">Paper</a> </p> ReaRAG-20k is a reasoning-focused dataset designed for training the ReaRAG model. It contains approximately 20,000 multi-turn retrieval examples constructed from the QA datasets such as HotpotQA, MuSiQue, and Natural Questions (NQ). Each instance follows a conversational format supporting reasoning and retrieval steps: ```json { "messages": [{"role": "user", "content": "..."}, {"role": "assistant", "reasoning": "..."}, {"role": "observation", "content": "..."}, ...] } ``` During sft, the loss is computed only on messages that contain the `reasoning` key, rather than the `content` key. # 🔗 Resources - **Code Repository:** [💻 GitHub](https://github.com/THU-KEG/ReaRAG) - **Paper:** [📃 ArXiv](https://arxiv.org/abs/2503.21729) - **Model:** [🤗 Huggingface](https://huggingface.co/THU-KEG/ReaRAG-9B). A model based on GLM-4-9B, sft on this dataset. # 📚 Citation If you use this dataset in your research or projects, please consider citing our work: ``` @article{lee2025rearag, title={ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation}, author={Lee, Zhicheng and Cao, Shulin and Liu, Jinxin and Zhang, Jiajie and Liu, Weichuan and Che, Xiaoyin and Hou, Lei and Li, Juanzi}, journal={arXiv preprint arXiv:2503.21729}, year={2025} } ```

# 📘 ReaRAG-20k 数据集卡片 <p align="center"> 🤗 <a href="https://huggingface.co/THU-KEG/ReaRAG-9B" target="_blank">模型</a> • 💻 <a href="https://github.com/THU-KEG/ReaRAG" target="_blank">GitHub仓库</a> • 📃 <a href="https://arxiv.org/abs/2503.21729" target="_blank">研究论文</a> </p> ReaRAG-20k是一款聚焦推理任务的专用数据集，旨在为ReaRAG模型的训练提供支撑。该数据集包含约20000条多轮检索示例，其构建数据源涵盖HotpotQA、MuSiQue以及Natural Questions（NQ）等主流问答数据集。每个数据实例均采用适配推理与检索流程的对话格式，具体结构如下： json { "messages": [{"role": "user", "content": "..."}, {"role": "assistant", "reasoning": "..."}, {"role": "observation", "content": "..."}, ...] } 在监督微调（Supervised Fine-Tuning，SFT）阶段，仅对带有`reasoning`字段的对话消息计算损失，而非带有`content`字段的消息。 # 🔗 相关资源 - **代码仓库：** [💻 GitHub](https://github.com/THU-KEG/ReaRAG) - **研究论文：** [📃 ArXiv](https://arxiv.org/abs/2503.21729) - **预训练模型：** [🤗 Hugging Face](https://huggingface.co/THU-KEG/ReaRAG-9B)。该模型基于GLM-4-9B构建，并使用本数据集完成监督微调。 # 📚 引用规范若您在研究工作或实际项目中使用本数据集，请引用如下文献： @article{lee2025rearag, title={ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation}, author={Lee, Zhicheng and Cao, Shulin and Liu, Jinxin and Zhang, Jiajie and Liu, Weichuan and Che, Xiaoyin and Hou, Lei and Li, Juanzi}, journal={arXiv preprint arXiv:2503.21729}, year={2025} }

提供机构：

maas

创建时间：

2025-07-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集