FlashRAG数据集
收藏魔搭社区2026-05-23 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/hhjinjiajie/FlashRAG_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# ⚡FlashRAG: A Python Toolkit for Efficient RAG Research
FlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. Our toolkit includes 32 pre-processed benchmark RAG datasets and 13 state-of-the-art RAG algorithms.
With FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components.
For more information, please view our GitHub repo and paper:
**GitHub repo**: [https://github.com/RUC-NLPIR/FlashRAG/](https://github.com/RUC-NLPIR/FlashRAG/)
**Huggingface Datasets**: [https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main)
**Paper link**: [FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research](https://arxiv.org/abs/2405.13576).
# Dataset Card for FlashRAG Datasets
<!-- Provide a quick summary of the dataset. -->
We have collected and processed 35 datasets widely used in RAG research, pre-processing them to ensure a consistent format for ease of use. For certain datasets (such as Wiki-asp), we have adapted them to fit the requirements of RAG tasks according to the methods commonly used within the community.
## Dataset Details
For each dataset, we save each split as a `jsonl` file, and each line is a dict as follows:
```python
{
'id': str,
'question': str,
'golden_answers': List[str],
'metadata': dict
}
```
Below is the list of datasets along with the corresponding sample sizes:
| Task | Dataset Name | Knowledge Source | # Train | # Dev | # Test |
|---------------------------|-----------------|------------------|-----------|---------|--------|
| QA | NQ | wiki | 79,168 | 8,757 | 3,610 |
| QA | TriviaQA | wiki & web | 78,785 | 8,837 | 11,313 |
| QA | PopQA | wiki | / | / | 14,267 |
| QA | SQuAD | wiki | 87,599 | 10,570 | / |
| QA | MSMARCO-QA | web | 808,731 | 101,093 | / |
| QA | NarrativeQA | books and story | 32,747 | 3,461 | 10,557 |
| QA | WikiQA | wiki | 20,360 | 2,733 | 6,165 |
| QA | WebQuestions | Google Freebase | 3,778 | / | 2,032 |
| QA | AmbigQA | wiki | 10,036 | 2,002 | / |
| QA | SIQA | - | 33,410 | 1,954 | / |
| QA | CommonSenseQA | - | 9,741 | 1,221 | / |
| QA | BoolQ | wiki | 9,427 | 3,270 | / |
| QA | PIQA | - | 16,113 | 1,838 | / |
| QA | Fermi | wiki | 8,000 | 1,000 | 1,000 |
| multi-hop QA | HotpotQA | wiki | 90,447 | 7,405 | / |
| multi-hop QA | 2WikiMultiHopQA | wiki | 15,000 | 12,576 | / |
| multi-hop QA | Musique | wiki | 19,938 | 2,417 | / |
| multi-hop QA | Bamboogle | wiki | / | / | 125 |
| multi-hop QA | StrategyQA | wiki | 2290 | / | /
| Long-form QA | ASQA | wiki | 4,353 | 948 | / |
| Long-form QA | ELI5 | Reddit | 272,634 | 1,507 | / |
| Long-form QA | WikiPassageQA | wiki | 3,332 | 417 | 416 |
| Open-Domain Summarization | WikiASP | wiki | 300,636 | 37,046 | 37,368 |
| multiple-choice | MMLU | - | 99,842 | 1,531 | 14,042 |
| multiple-choice | TruthfulQA | wiki | / | 817 | / |
| multiple-choice | HellaSWAG | ActivityNet | 39,905 | 10,042 | / |
| multiple-choice | ARC | - | 3,370 | 869 | 3,548 |
| multiple-choice | OpenBookQA | - | 4,957 | 500 | 500 |
| multiple-choice | QuaRTz | - | 2696 | 384 | 784 |
| Fact Verification | FEVER | wiki | 104,966 | 10,444 | / |
| Dialog Generation | WOW | wiki | 63,734 | 3,054 | / |
| Entity Linking | AIDA CoNll-yago | Freebase & wiki | 18,395 | 4,784 | / |
| Entity Linking | WNED | Wiki | / | 8,995 | / |
| Slot Filling | T-REx | DBPedia | 2,284,168 | 5,000 | / |
| Slot Filling | Zero-shot RE | wiki | 147,909 | 3,724 | / |
| In-domain QA| DomainRAG | Web pages of RUC| / | / | 485|
## Retrieval Corpus
We also provide a corpus document library for retrieval, with the path in FlashRAG/retrieval-corpus.
```jsonl
{"id":"0", "contents": "...."}
{"id":"1", "contents": "..."}
```
The `contents` key is essential for building the index. For documents that include both text and title, we recommend setting the value of `contents` to `{title}\n{text}`. The corpus file can also contain other keys to record additional characteristics of the documents.
Detail information of provided can be found in our github link: [https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus).
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
Please kindly cite our paper if helps your research:
```BibTex
@article{FlashRAG,
author={Jiajie Jin and
Yutao Zhu and
Xinyu Yang and
Chenghao Zhang and
Zhicheng Dou},
title={FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research},
journal={CoRR},
volume={abs/2405.13576},
year={2024},
url={https://arxiv.org/abs/2405.13576},
eprinttype={arXiv},
eprint={2405.13576}
}
```
# ⚡FlashRAG: 高效检索增强生成(Retrieval Augmented Generation,RAG)研究Python工具包
FlashRAG是一款用于复现与开发检索增强生成研究的Python工具包。本工具包包含32个经过预处理的基准RAG数据集与13种当前最优(State-of-the-Art,SOTA)的RAG算法。
依托FlashRAG及配套资源,您可轻松复现RAG领域现有顶尖研究成果,或自定义实现您的RAG流程与组件。
更多详情请参阅我们的GitHub仓库与论文:
**GitHub仓库**: [https://github.com/RUC-NLPIR/FlashRAG/](https://github.com/RUC-NLPIR/FlashRAG/)
**Huggingface数据集**: [https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main)
**论文链接**: [FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research](https://arxiv.org/abs/2405.13576).
# FlashRAG数据集卡片
<!-- 提供数据集快速概览 -->
我们收集并处理了RAG研究中广泛使用的35个数据集,并进行标准化预处理以确保格式统一、便于使用。针对部分数据集(如Wiki-asp),我们按照社区通用方法对其进行适配,以满足RAG任务的需求。
## 数据集详情
对于每个数据集,我们将其各划分集保存为`jsonl`文件,每一行均为如下格式的字典:
python
{
'id': str,
'question': str,
'golden_answers': List[str],
'metadata': dict
}
以下为各数据集列表及对应样本规模:
| 任务类型 | 数据集名称 | 知识来源 | 训练集样本数 | 开发集样本数 | 测试集样本数 |
|---------------------------|-----------------|------------------|-----------|---------|--------|
| 问答任务 | NQ | 维基百科 | 79,168 | 8,757 | 3,610 |
| 问答任务 | TriviaQA | 维基百科与网页 | 78,785 | 8,837 | 11,313 |
| 问答任务 | PopQA | 维基百科 | / | / | 14,267 |
| 问答任务 | SQuAD | 维基百科 | 87,599 | 10,570 | / |
| 问答任务 | MSMARCO-QA | 网页 | 808,731 | 101,093 | / |
| 问答任务 | NarrativeQA | 书籍与故事文本 | 32,747 | 3,461 | 10,557 |
| 问答任务 | WikiQA | 维基百科 | 20,360 | 2,733 | 6,165 |
| 问答任务 | WebQuestions | 谷歌Freebase | 3,778 | / | 2,032 |
| 问答任务 | AmbigQA | 维基百科 | 10,036 | 2,002 | / |
| 问答任务 | SIQA | 无 | 33,410 | 1,954 | / |
| 问答任务 | CommonSenseQA | 无 | 9,741 | 1,221 | / |
| 问答任务 | BoolQ | 维基百科 | 9,427 | 3,270 | / |
| 问答任务 | PIQA | 无 | 16,113 | 1,838 | / |
| 问答任务 | Fermi | 维基百科 | 8,000 | 1,000 | 1,000 |
| 多跳问答 | HotpotQA | 维基百科 | 90,447 | 7,405 | / |
| 多跳问答 | 2WikiMultiHopQA | 维基百科 | 15,000 | 12,576 | / |
| 多跳问答 | Musique | 维基百科 | 19,938 | 2,417 | / |
| 多跳问答 | Bamboogle | 维基百科 | / | / | 125 |
| 多跳问答 | StrategyQA | 维基百科 | 2290 | / | / |
| 长文本问答 | ASQA | 维基百科 | 4,353 | 948 | / |
| 长文本问答 | ELI5 | Reddit | 272,634 | 1,507 | / |
| 长文本问答 | WikiPassageQA | 维基百科 | 3,332 | 417 | 416 |
| 开放域摘要生成 | WikiASP | 维基百科 | 300,636 | 37,046 | 37,368 |
| 多项选择任务 | MMLU | 无 | 99,842 | 1,531 | 14,042 |
| 多项选择任务 | TruthfulQA | 维基百科 | / | 817 | / |
| 多项选择任务 | HellaSWAG | ActivityNet | 39,905 | 10,042 | / |
| 多项选择任务 | ARC | 无 | 3,370 | 869 | 3,548 |
| 多项选择任务 | OpenBookQA | 无 | 4,957 | 500 | 500 |
| 多项选择任务 | QuaRTz | 无 | 2696 | 384 | 784 |
| 事实核查任务 | FEVER | 维基百科 | 104,966 | 10,444 | / |
| 对话生成任务 | WOW | 维基百科 | 63,734 | 3,054 | / |
| 实体链接任务 | AIDA CoNll-yago | Freebase与维基百科 | 18,395 | 4,784 | / |
| 实体链接任务 | WNED | 维基百科 | / | 8,995 | / |
| 槽位填充任务 | T-REx | DBPedia | 2,284,168 | 5,000 | / |
| 槽位填充任务 | Zero-shot RE | 维基百科 | 147,909 | 3,724 | / |
| 领域内问答 | DomainRAG | 中国人民大学校园网页 | / | / | 485 |
## 检索语料库
我们还提供了用于检索的文档语料库,其路径位于FlashRAG/retrieval-corpus目录下。
jsonl
{"id":"0", "contents": "...."}
{"id":"1", "contents": "..."}
`contents`字段是构建检索索引的核心字段。对于同时包含标题与正文的文档,我们建议将`contents`字段的值设置为`{标题}
{正文}`的格式。语料文件亦可包含其他字段以记录文档的额外属性。更多详细信息可参阅我们的GitHub链接:[https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#document-corpus)。
## 引用说明
<!-- 若有介绍该数据集的论文或博客文章,应在此处提供APA及BibTeX格式的引用信息。 -->
**BibTeX格式引用:**
若本工具包或数据集对您的研究有所帮助,请引用我们的论文:
BibTex
@article{FlashRAG,
author={Jiajie Jin and
Yutao Zhu and
Xinyu Yang and
Chenghao Zhang and
Zhicheng Dou},
title={FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research},
journal={CoRR},
volume={abs/2405.13576},
year={2024},
url={https://arxiv.org/abs/2405.13576},
eprinttype={arXiv},
eprint={2405.13576}
}
提供机构:
maas
创建时间:
2024-11-01
搜集汇总
数据集介绍

背景与挑战
背景概述
FlashRAG数据集是一个用于检索增强生成(RAG)研究的Python工具包,包含32个预处理的基准数据集和13种先进算法,支持多种任务如问答和多跳问答。数据集以统一的jsonl格式提供,并附带检索语料库,便于研究人员复现和开发RAG相关工作。
以上内容由遇见数据集搜集并总结生成



