下载链接：

https://modelscope.cn/datasets/FreedomIntelligence/RAG-Instruct

下载链接

链接失效反馈

官方服务：

资源简介：

## Introduction RAG-Instruct is a RAG dataset designed to comprehensively enhance LLM RAG capabilities, synthesized using GPT-4o. This dataset is based on the Wikipedia corpus and This dataset is based on the Wikipedia corpus and offers the advantages of query-document scenario diversity and task diversity. The RAG-Instruct dataset can significantly enhance the RAG ability of LLMs and make remarkable improvements in RAG performance across various tasks. | Model | WQA (acc) | PQA (acc) | TQA (acc) | OBQA (EM) | Pub (EM) | ARC (EM) | 2WIKI (acc) | HotP (acc) | MSQ (acc) | CFQA (EM) | PubMed (EM) | |--------------------------------|-----------|-----------|-----------|-----------|----------|----------|-------------|------------|-----------|-----------|-------------| | Llama3.2-3B | 58.7 | 61.8 | 69.7 | 77.0 | 55.0 | 66.8 | 55.6 | 40.2 | 13.2 | 46.8 | 70.3 | | Llama3.1-8B | 59.5 | 60.8 | 73.4 | 82.0 | 56.7 | 77.1 | 65.6 | 45.6 | 18.7 | 56.5 | 73.9 | | Llama3.2-3B + RAG-Instruct | 65.3 | 64.0 | 77.0 | 81.2 | 66.4 | 73.0 | 72.9 | 52.7 | 25.0 | 50.3 | 72.6 | | Llama3.1-8B + RAG-Instruct | 69.7 | 68.4 | 79.3 | 84.8 | 77.2 | 79.9 | 79.3 | 56.4 | 30.3 | 57.8 | 77.0 | For details, see our [paper](https://arxiv.org/abs/2501.00353) and [GitHub repository](https://github.com/FreedomIntelligence/RAG-Instruct). ## Citation If you find our data useful, please consider citing our work! ``` @misc{liu2024raginstructboostingllmsdiverse, title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions}, author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang}, year={2024}, eprint={2501.00353}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.00353}, } ```

# 简介 RAG-Instruct是一款专为全面提升大语言模型（Large Language Model, LLM）检索增强生成（Retrieval-Augmented Generation, RAG）能力而设计的数据集，由GPT-4o合成生成。该数据集基于维基百科语料库，兼具查询-文档场景多样化与任务类型多样化的优势。本数据集可显著强化大语言模型的检索增强生成能力，并在各类任务的检索增强生成性能上实现显著提升。 | 模型 | WQA (准确率) | PQA (准确率) | TQA (准确率) | OBQA (精确匹配率) | Pub (精确匹配率) | ARC (精确匹配率) | 2WIKI (准确率) | HotP (准确率) | MSQ (准确率) | CFQA (精确匹配率) | PubMed (精确匹配率) | |--------------------------------|-----------|-----------|-----------|-----------|----------|----------|-------------|------------|-----------|-----------|-------------| | Llama3.2-3B | 58.7 | 61.8 | 69.7 | 77.0 | 55.0 | 66.8 | 55.6 | 40.2 | 13.2 | 46.8 | 70.3 | | Llama3.1-8B | 59.5 | 60.8 | 73.4 | 82.0 | 56.7 | 77.1 | 65.6 | 45.6 | 18.7 | 56.5 | 73.9 | | Llama3.2-3B + RAG-Instruct | 65.3 | 64.0 | 77.0 | 81.2 | 66.4 | 73.0 | 72.9 | 52.7 | 25.0 | 50.3 | 72.6 | | Llama3.1-8B + RAG-Instruct | 69.7 | 68.4 | 79.3 | 84.8 | 77.2 | 79.9 | 79.3 | 56.4 | 30.3 | 57.8 | 77.0 | 如需了解更多细节，请参阅我们的[论文](https://arxiv.org/abs/2501.00353)与[GitHub仓库](https://github.com/FreedomIntelligence/RAG-Instruct)。 ## 引用若您认为本数据集对您的研究有所帮助，请考虑引用我们的工作！ @misc{liu2024raginstructboostingllmsdiverse, title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions}, author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang}, year={2024}, eprint={2501.00353}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.00353}, }

应用场景：