five

FlashRAG_datasets

收藏
魔搭社区2026-01-08 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/yamseyoung/FlashRAG_datasets
下载链接
链接失效反馈
官方服务:
资源简介:
# ⚡FlashRAG: A Python Toolkit for Efficient RAG Research FlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. Our toolkit includes 36 pre-processed benchmark RAG datasets and 16 state-of-the-art RAG algorithms. With FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components. For more information, please view our GitHub repo and paper: GitHub repo: [https://github.com/RUC-NLPIR/FlashRAG/](https://github.com/RUC-NLPIR/FlashRAG/) Paper link: [FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research](https://arxiv.org/abs/2405.13576). # Dataset Card for FlashRAG Datasets <!-- Provide a quick summary of the dataset. --> We have collected and processed 35 datasets widely used in RAG research, pre-processing them to ensure a consistent format for ease of use. For certain datasets (such as Wiki-asp), we have adapted them to fit the requirements of RAG tasks according to the methods commonly used within the community. ## Dataset Details For each dataset, we save each split as a `jsonl` file, and each line is a dict as follows: ```python { 'id': str, 'question': str, 'golden_answers': List[str], 'metadata': dict } ``` Below is the list of datasets along with the corresponding sample sizes: | Task | Dataset Name | Knowledge Source | # Train | # Dev | # Test | |---------------------------|-----------------|------------------|-----------|---------|--------| | QA | NQ | wiki | 79,168 | 8,757 | 3,610 | | QA | TriviaQA | wiki & web | 78,785 | 8,837 | 11,313 | | QA | PopQA | wiki | / | / | 14,267 | | QA | SQuAD | wiki | 87,599 | 10,570 | / | | QA | MSMARCO-QA | web | 808,731 | 101,093 | / | | QA | NarrativeQA | books and story | 32,747 | 3,461 | 10,557 | | QA | WikiQA | wiki | 20,360 | 2,733 | 6,165 | | QA | WebQuestions | Google Freebase | 3,778 | / | 2,032 | | QA | AmbigQA | wiki | 10,036 | 2,002 | / | | QA | SIQA | - | 33,410 | 1,954 | / | | QA | CommonSenseQA | - | 9,741 | 1,221 | / | | QA | BoolQ | wiki | 9,427 | 3,270 | / | | QA | PIQA | - | 16,113 | 1,838 | / | | QA | Fermi | wiki | 8,000 | 1,000 | 1,000 | | multi-hop QA | HotpotQA | wiki | 90,447 | 7,405 | / | | multi-hop QA | 2WikiMultiHopQA | wiki | 15,000 | 12,576 | / | | multi-hop QA | Musique | wiki | 19,938 | 2,417 | / | | multi-hop QA | Bamboogle | wiki | / | / | 125 | | multi-hop QA | StrategyQA | wiki | 2290 | / | / | | Long-form QA | ASQA | wiki | 4,353 | 948 | / | | Long-form QA | ELI5 | Reddit | 272,634 | 1,507 | / | | Long-form QA | WikiPassageQA | wiki | 3,332 | 417 | 416 | | Open-Domain Summarization | WikiASP | wiki | 300,636 | 37,046 | 37,368 | | multiple-choice | MMLU | - | 99,842 | 1,531 | 14,042 | | multiple-choice | TruthfulQA | wiki | / | 817 | / | | multiple-choice | HellaSWAG | ActivityNet | 39,905 | 10,042 | / | | multiple-choice | ARC | - | 3,370 | 869 | 3,548 | | multiple-choice | OpenBookQA | - | 4,957 | 500 | 500 | | multiple-choice | QuaRTz | - | 2696 | 384 | 784 | | Fact Verification | FEVER | wiki | 104,966 | 10,444 | / | | Dialog Generation | WOW | wiki | 63,734 | 3,054 | / | | Entity Linking | AIDA CoNll-yago | Freebase & wiki | 18,395 | 4,784 | / | | Entity Linking | WNED | Wiki | / | 8,995 | / | | Slot Filling | T-REx | DBPedia | 2,284,168 | 5,000 | / | | Slot Filling | Zero-shot RE | wiki | 147,909 | 3,724 | / | | In-domain QA| DomainRAG | Web pages of RUC| / | / | 485| ## Retrieval Corpus We also provide a corpus document library for retrieval, with the path in FlashRAG/retrieval-corpus. ```jsonl {"id":"0", "contents": "...."} {"id":"1", "contents": "..."} ``` The `contents` key is essential for building the index. For documents that include both text and title, we recommend setting the value of `contents` to `{title}\n{text}`. The corpus file can also contain other keys to record additional characteristics of the documents. Detail information of provided can be found in our github link: [https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus). ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** Please kindly cite our paper if helps your research: ```BibTex @article{FlashRAG, author={Jiajie Jin and Yutao Zhu and Xinyu Yang and Chenghao Zhang and Zhicheng Dou}, title={FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research}, journal={CoRR}, volume={abs/2405.13576}, year={2024}, url={https://arxiv.org/abs/2405.13576}, eprinttype={arXiv}, eprint={2405.13576} } ```

# ⚡FlashRAG:面向高效检索增强生成研究的Python工具库 FlashRAG是一款面向检索增强生成(Retrieval Augmented Generation,简称RAG)研究的复现与开发的Python工具库。本工具库涵盖36个经过预处理的基准RAG数据集与16种当前最优(SOTA)的RAG算法。依托FlashRAG及其配套资源,用户可轻松复现RAG领域内当前已有的顶尖研究成果,或自行定制开发专属的RAG流程与组件。 如需了解更多信息,请访问我们的GitHub仓库与论文: GitHub仓库:[https://github.com/RUC-NLPIR/FlashRAG/](https://github.com/RUC-NLPIR/FlashRAG/) 论文链接:[FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research](https://arxiv.org/abs/2405.13576) # FlashRAG数据集卡片 <!-- 简要概述本数据集 --> 我们收集并处理了35个在RAG研究中广泛使用的数据集,并对其进行统一格式预处理以降低使用门槛。针对部分数据集(如Wiki-asp),我们参照社区通用实践对其进行适配,以满足RAG任务的需求。 ## 数据集详情 针对每个数据集,我们将其各个拆分子集保存为`jsonl`格式文件,每行内容均为如下格式的字典: python { 'id': str, 'question': str, 'golden_answers': List[str], 'metadata': dict } 以下为数据集列表及对应样本规模: | 任务类型 | 数据集名称 | 知识来源 | 训练集规模 | 开发集规模 | 测试集规模 | |---------------------------|-----------------|----------------|-----------|---------|--------| | 问答任务 | NQ | wiki | 79,168 | 8,757 | 3,610 | | 问答任务 | TriviaQA | wiki & web | 78,785 | 8,837 | 11,313 | | 问答任务 | PopQA | wiki | / | / | 14,267 | | 问答任务 | SQuAD | wiki | 87,599 | 10,570 | / | | 问答任务 | MSMARCO-QA | web | 808,731 | 101,093 | / | | 问答任务 | NarrativeQA | 书籍与故事 | 32,747 | 3,461 | 10,557 | | 问答任务 | WikiQA | wiki | 20,360 | 2,733 | 6,165 | | 问答任务 | WebQuestions | Google Freebase | 3,778 | / | 2,032 | | 问答任务 | AmbigQA | wiki | 10,036 | 2,002 | / | | 问答任务 | SIQA | - | 33,410 | 1,954 | / | | 问答任务 | CommonSenseQA | - | 9,741 | 1,221 | / | | 问答任务 | BoolQ | wiki | 9,427 | 3,270 | / | | 问答任务 | PIQA | - | 16,113 | 1,838 | / | | 问答任务 | Fermi | wiki | 8,000 | 1,000 | 1,000 | | 多跳问答任务 | HotpotQA | wiki | 90,447 | 7,405 | / | | 多跳问答任务 | 2WikiMultiHopQA | wiki | 15,000 | 12,576 | / | | 多跳问答任务 | Musique | wiki | 19,938 | 2,417 | / | | 多跳问答任务 | Bamboogle | wiki | / | / | 125 | | 多跳问答任务 | StrategyQA | wiki | 2290 | / | / | | 长文本问答任务 | ASQA | wiki | 4,353 | 948 | / | | 长文本问答任务 | ELI5 | Reddit | 272,634 | 1,507 | / | | 长文本问答任务 | WikiPassageQA | wiki | 3,332 | 417 | 416 | | 开放域摘要任务 | WikiASP | wiki | 300,636 | 37,046 | 37,368 | | 多项选择任务 | MMLU | - | 99,842 | 1,531 | 14,042 | | 多项选择任务 | TruthfulQA | wiki | / | 817 | / | | 多项选择任务 | HellaSWAG | ActivityNet | 39,905 | 10,042 | / | | 多项选择任务 | ARC | - | 3,370 | 869 | 3,548 | | 多项选择任务 | OpenBookQA | - | 4,957 | 500 | 500 | | 多项选择任务 | QuaRTz | - | 2696 | 384 | 784 | | 事实核查任务 | FEVER | wiki | 104,966 | 10,444 | / | | 对话生成任务 | WOW | wiki | 63,734 | 3,054 | / | | 实体链接任务 | AIDA CoNll-yago | Freebase & wiki | 18,395 | 4,784 | / | | 实体链接任务 | WNED | Wiki | / | 8,995 | / | | 槽位填充任务 | T-REx | DBPedia | 2,284,168 | 5,000 | / | | 槽位填充任务 | Zero-shot RE | wiki | 147,909 | 3,724 | / | | 领域内问答任务 | DomainRAG | RUC官网网页 | / | / | 485 | ## 检索语料库 我们同时提供了用于检索的语料文档库,存储路径为FlashRAG/retrieval-corpus。 jsonl {"id":"0", "contents": "...."} {"id":"1", "contents": "..."} `contents`字段是构建索引的核心字段。对于同时包含标题与正文的文档,我们建议将`contents`字段的值设置为`{标题} {正文}`的格式。语料文件亦可包含其他字段,用以记录文档的额外属性。有关配套语料库的详细信息,请参阅我们的GitHub仓库链接:[https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus)。 ## 引用 <!-- 若有介绍该数据集的论文或博客文章,请在此处添加其APA与BibTeX引用信息 --> **BibTeX 格式:** 若本工具库对您的研究有所帮助,请引用我们的论文: BibTex @article{FlashRAG, author={Jiajie Jin and Yutao Zhu and Xinyu Yang and Chenghao Zhang and Zhicheng Dou}, title={FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research}, journal={CoRR}, volume={abs/2405.13576}, year={2024}, url={https://arxiv.org/abs/2405.13576}, eprinttype={arXiv}, eprint={2405.13576} }
提供机构:
maas
创建时间:
2025-07-10
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
FlashRAG_datasets是一个专为检索增强生成(RAG)研究设计的综合数据集集合,包含36个预处理的基准数据集,覆盖问答、多跳问答、长形式问答、多项选择、事实核查等多种任务,所有数据均统一为jsonl格式以确保易用性。该数据集还附带检索语料库,支持高效的RAG算法复现和自定义开发,是RAG领域研究和实验的重要资源。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作