FlashRAG数据集

Name: FlashRAG数据集
Creator: maas
Published: 2026-05-23 23:00:42
License: 暂无描述

魔搭社区2026-05-23 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/hhjinjiajie/FlashRAG_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# ⚡FlashRAG: A Python Toolkit for Efficient RAG Research FlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. Our toolkit includes 32 pre-processed benchmark RAG datasets and 13 state-of-the-art RAG algorithms. With FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components. For more information, please view our GitHub repo and paper: **GitHub repo**: [https://github.com/RUC-NLPIR/FlashRAG/](https://github.com/RUC-NLPIR/FlashRAG/) **Huggingface Datasets**: [https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main) **Paper link**: [FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research](https://arxiv.org/abs/2405.13576). # Dataset Card for FlashRAG Datasets  We have collected and processed 35 datasets widely used in RAG research, pre-processing them to ensure a consistent format for ease of use. For certain datasets (such as Wiki-asp), we have adapted them to fit the requirements of RAG tasks according to the methods commonly used within the community. ## Dataset Details For each dataset, we save each split as a `jsonl` file, and each line is a dict as follows: ```python { 'id': str, 'question': str, 'golden_answers': List[str], 'metadata': dict } ``` Below is the list of datasets along with the corresponding sample sizes: | Task | Dataset Name | Knowledge Source | # Train | # Dev | # Test | |---------------------------|-----------------|------------------|-----------|---------|--------| | QA | NQ | wiki | 79,168 | 8,757 | 3,610 | | QA | TriviaQA | wiki & web | 78,785 | 8,837 | 11,313 | | QA | PopQA | wiki | / | / | 14,267 | | QA | SQuAD | wiki | 87,599 | 10,570 | / | | QA | MSMARCO-QA | web | 808,731 | 101,093 | / | | QA | NarrativeQA | books and story | 32,747 | 3,461 | 10,557 | | QA | WikiQA | wiki | 20,360 | 2,733 | 6,165 | | QA | WebQuestions | Google Freebase | 3,778 | / | 2,032 | | QA | AmbigQA | wiki | 10,036 | 2,002 | / | | QA | SIQA | - | 33,410 | 1,954 | / | | QA | CommonSenseQA | - | 9,741 | 1,221 | / | | QA | BoolQ | wiki | 9,427 | 3,270 | / | | QA | PIQA | - | 16,113 | 1,838 | / | | QA | Fermi | wiki | 8,000 | 1,000 | 1,000 | | multi-hop QA | HotpotQA | wiki | 90,447 | 7,405 | / | | multi-hop QA | 2WikiMultiHopQA | wiki | 15,000 | 12,576 | / | | multi-hop QA | Musique | wiki | 19,938 | 2,417 | / | | multi-hop QA | Bamboogle | wiki | / | / | 125 | | multi-hop QA | StrategyQA | wiki | 2290 | / | / | Long-form QA | ASQA | wiki | 4,353 | 948 | / | | Long-form QA | ELI5 | Reddit | 272,634 | 1,507 | / | | Long-form QA | WikiPassageQA | wiki | 3,332 | 417 | 416 | | Open-Domain Summarization | WikiASP | wiki | 300,636 | 37,046 | 37,368 | | multiple-choice | MMLU | - | 99,842 | 1,531 | 14,042 | | multiple-choice | TruthfulQA | wiki | / | 817 | / | | multiple-choice | HellaSWAG | ActivityNet | 39,905 | 10,042 | / | | multiple-choice | ARC | - | 3,370 | 869 | 3,548 | | multiple-choice | OpenBookQA | - | 4,957 | 500 | 500 | | multiple-choice | QuaRTz | - | 2696 | 384 | 784 | | Fact Verification | FEVER | wiki | 104,966 | 10,444 | / | | Dialog Generation | WOW | wiki | 63,734 | 3,054 | / | | Entity Linking | AIDA CoNll-yago | Freebase & wiki | 18,395 | 4,784 | / | | Entity Linking | WNED | Wiki | / | 8,995 | / | | Slot Filling | T-REx | DBPedia | 2,284,168 | 5,000 | / | | Slot Filling | Zero-shot RE | wiki | 147,909 | 3,724 | / | | In-domain QA| DomainRAG | Web pages of RUC| / | / | 485| ## Retrieval Corpus We also provide a corpus document library for retrieval, with the path in FlashRAG/retrieval-corpus. ```jsonl {"id":"0", "contents": "...."} {"id":"1", "contents": "..."} ``` The `contents` key is essential for building the index. For documents that include both text and title, we recommend setting the value of `contents` to `{title}\n{text}`. The corpus file can also contain other keys to record additional characteristics of the documents. Detail information of provided can be found in our github link: [https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus). ## Citation  **BibTeX:** Please kindly cite our paper if helps your research: ```BibTex @article{FlashRAG, author={Jiajie Jin and Yutao Zhu and Xinyu Yang and Chenghao Zhang and Zhicheng Dou}, title={FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research}, journal={CoRR}, volume={abs/2405.13576}, year={2024}, url={https://arxiv.org/abs/2405.13576}, eprinttype={arXiv}, eprint={2405.13576} } ```

# ⚡FlashRAG: 高效检索增强生成（Retrieval Augmented Generation，RAG）研究Python工具包 FlashRAG是一款用于复现与开发检索增强生成研究的Python工具包。本工具包包含32个经过预处理的基准RAG数据集与13种当前最优（State-of-the-Art，SOTA）的RAG算法。依托FlashRAG及配套资源，您可轻松复现RAG领域现有顶尖研究成果，或自定义实现您的RAG流程与组件。更多详情请参阅我们的GitHub仓库与论文： **GitHub仓库**: [https://github.com/RUC-NLPIR/FlashRAG/](https://github.com/RUC-NLPIR/FlashRAG/) **Huggingface数据集**: [https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main) **论文链接**: [FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research](https://arxiv.org/abs/2405.13576). # FlashRAG数据集卡片  我们收集并处理了RAG研究中广泛使用的35个数据集，并进行标准化预处理以确保格式统一、便于使用。针对部分数据集（如Wiki-asp），我们按照社区通用方法对其进行适配，以满足RAG任务的需求。 ## 数据集详情对于每个数据集，我们将其各划分集保存为`jsonl`文件，每一行均为如下格式的字典： python { 'id': str, 'question': str, 'golden_answers': List[str], 'metadata': dict } 以下为各数据集列表及对应样本规模： | 任务类型 | 数据集名称 | 知识来源 | 训练集样本数 | 开发集样本数 | 测试集样本数 | |---------------------------|-----------------|------------------|-----------|---------|--------| | 问答任务 | NQ | 维基百科 | 79,168 | 8,757 | 3,610 | | 问答任务 | TriviaQA | 维基百科与网页 | 78,785 | 8,837 | 11,313 | | 问答任务 | PopQA | 维基百科 | / | / | 14,267 | | 问答任务 | SQuAD | 维基百科 | 87,599 | 10,570 | / | | 问答任务 | MSMARCO-QA | 网页 | 808,731 | 101,093 | / | | 问答任务 | NarrativeQA | 书籍与故事文本 | 32,747 | 3,461 | 10,557 | | 问答任务 | WikiQA | 维基百科 | 20,360 | 2,733 | 6,165 | | 问答任务 | WebQuestions | 谷歌Freebase | 3,778 | / | 2,032 | | 问答任务 | AmbigQA | 维基百科 | 10,036 | 2,002 | / | | 问答任务 | SIQA | 无 | 33,410 | 1,954 | / | | 问答任务 | CommonSenseQA | 无 | 9,741 | 1,221 | / | | 问答任务 | BoolQ | 维基百科 | 9,427 | 3,270 | / | | 问答任务 | PIQA | 无 | 16,113 | 1,838 | / | | 问答任务 | Fermi | 维基百科 | 8,000 | 1,000 | 1,000 | | 多跳问答 | HotpotQA | 维基百科 | 90,447 | 7,405 | / | | 多跳问答 | 2WikiMultiHopQA | 维基百科 | 15,000 | 12,576 | / | | 多跳问答 | Musique | 维基百科 | 19,938 | 2,417 | / | | 多跳问答 | Bamboogle | 维基百科 | / | / | 125 | | 多跳问答 | StrategyQA | 维基百科 | 2290 | / | / | | 长文本问答 | ASQA | 维基百科 | 4,353 | 948 | / | | 长文本问答 | ELI5 | Reddit | 272,634 | 1,507 | / | | 长文本问答 | WikiPassageQA | 维基百科 | 3,332 | 417 | 416 | | 开放域摘要生成 | WikiASP | 维基百科 | 300,636 | 37,046 | 37,368 | | 多项选择任务 | MMLU | 无 | 99,842 | 1,531 | 14,042 | | 多项选择任务 | TruthfulQA | 维基百科 | / | 817 | / | | 多项选择任务 | HellaSWAG | ActivityNet | 39,905 | 10,042 | / | | 多项选择任务 | ARC | 无 | 3,370 | 869 | 3,548 | | 多项选择任务 | OpenBookQA | 无 | 4,957 | 500 | 500 | | 多项选择任务 | QuaRTz | 无 | 2696 | 384 | 784 | | 事实核查任务 | FEVER | 维基百科 | 104,966 | 10,444 | / | | 对话生成任务 | WOW | 维基百科 | 63,734 | 3,054 | / | | 实体链接任务 | AIDA CoNll-yago | Freebase与维基百科 | 18,395 | 4,784 | / | | 实体链接任务 | WNED | 维基百科 | / | 8,995 | / | | 槽位填充任务 | T-REx | DBPedia | 2,284,168 | 5,000 | / | | 槽位填充任务 | Zero-shot RE | 维基百科 | 147,909 | 3,724 | / | | 领域内问答 | DomainRAG | 中国人民大学校园网页 | / | / | 485 | ## 检索语料库我们还提供了用于检索的文档语料库，其路径位于FlashRAG/retrieval-corpus目录下。 jsonl {"id":"0", "contents": "...."} {"id":"1", "contents": "..."} `contents`字段是构建检索索引的核心字段。对于同时包含标题与正文的文档，我们建议将`contents`字段的值设置为`{标题} {正文}`的格式。语料文件亦可包含其他字段以记录文档的额外属性。更多详细信息可参阅我们的GitHub链接：[https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus)。 ## 引用说明  **BibTeX格式引用：** 若本工具包或数据集对您的研究有所帮助，请引用我们的论文： BibTex @article{FlashRAG, author={Jiajie Jin and Yutao Zhu and Xinyu Yang and Chenghao Zhang and Zhicheng Dou}, title={FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research}, journal={CoRR}, volume={abs/2405.13576}, year={2024}, url={https://arxiv.org/abs/2405.13576}, eprinttype={arXiv}, eprint={2405.13576} }

提供机构：

maas

创建时间：

2024-11-01

搜集汇总

数据集介绍