ReaRAG-20k
收藏魔搭社区2025-12-02 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/THU-KEG/ReaRAG-20k
下载链接
链接失效反馈官方服务:
资源简介:
# 📘 Dataset Card for ReaRAG-20k
<p align="center">
🤗 <a href="https://huggingface.co/THU-KEG/ReaRAG-9B" target="_blank">Model</a> • 💻 <a href="https://github.com/THU-KEG/ReaRAG" target="_blank">GitHub</a> • 📃 <a href="https://arxiv.org/abs/2503.21729" target="_blank">Paper</a>
</p>
ReaRAG-20k is a reasoning-focused dataset designed for training the ReaRAG model. It contains approximately 20,000 multi-turn retrieval examples constructed from the QA datasets such as HotpotQA, MuSiQue, and Natural Questions (NQ).
Each instance follows a conversational format supporting reasoning and retrieval steps:
```json
{
"messages": [{"role": "user", "content": "..."},
{"role": "assistant", "reasoning": "..."},
{"role": "observation", "content": "..."}, ...]
}
```
During sft, the loss is computed only on messages that contain the `reasoning` key, rather than the `content` key.
# 🔗 Resources
- **Code Repository:** [💻 GitHub](https://github.com/THU-KEG/ReaRAG)
- **Paper:** [📃 ArXiv](https://arxiv.org/abs/2503.21729)
- **Model:** [🤗 Huggingface](https://huggingface.co/THU-KEG/ReaRAG-9B). A model based on GLM-4-9B, sft on this dataset.
# 📚 Citation
If you use this dataset in your research or projects, please consider citing our work:
```
@article{lee2025rearag,
title={ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation},
author={Lee, Zhicheng and Cao, Shulin and Liu, Jinxin and Zhang, Jiajie and Liu, Weichuan and Che, Xiaoyin and Hou, Lei and Li, Juanzi},
journal={arXiv preprint arXiv:2503.21729},
year={2025}
}
```
# 📘 ReaRAG-20k 数据集卡片
<p align="center">
🤗 <a href="https://huggingface.co/THU-KEG/ReaRAG-9B" target="_blank">模型</a> • 💻 <a href="https://github.com/THU-KEG/ReaRAG" target="_blank">GitHub仓库</a> • 📃 <a href="https://arxiv.org/abs/2503.21729" target="_blank">研究论文</a>
</p>
ReaRAG-20k是一款聚焦推理任务的专用数据集,旨在为ReaRAG模型的训练提供支撑。该数据集包含约20000条多轮检索示例,其构建数据源涵盖HotpotQA、MuSiQue以及Natural Questions(NQ)等主流问答数据集。
每个数据实例均采用适配推理与检索流程的对话格式,具体结构如下:
json
{
"messages": [{"role": "user", "content": "..."},
{"role": "assistant", "reasoning": "..."},
{"role": "observation", "content": "..."}, ...]
}
在监督微调(Supervised Fine-Tuning,SFT)阶段,仅对带有`reasoning`字段的对话消息计算损失,而非带有`content`字段的消息。
# 🔗 相关资源
- **代码仓库:** [💻 GitHub](https://github.com/THU-KEG/ReaRAG)
- **研究论文:** [📃 ArXiv](https://arxiv.org/abs/2503.21729)
- **预训练模型:** [🤗 Hugging Face](https://huggingface.co/THU-KEG/ReaRAG-9B)。该模型基于GLM-4-9B构建,并使用本数据集完成监督微调。
# 📚 引用规范
若您在研究工作或实际项目中使用本数据集,请引用如下文献:
@article{lee2025rearag,
title={ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation},
author={Lee, Zhicheng and Cao, Shulin and Liu, Jinxin and Zhang, Jiajie and Liu, Weichuan and Che, Xiaoyin and Hou, Lei and Li, Juanzi},
journal={arXiv preprint arXiv:2503.21729},
year={2025}
}
提供机构:
maas
创建时间:
2025-07-15



