UltraRAG_Benchmark
收藏魔搭社区2026-05-16 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/UltraRAG/UltraRAG_Benchmark
下载链接
链接失效反馈官方服务:
资源简介:
## UltraRAG 2.0: Accelerating RAG for Scientific Research
UltraRAG 2.0 (UR-2.0) is jointly released by <a href="https://nlp.csai.tsinghua.edu.cn/" target="_blank">THUNLP</a>, <a href="https://neuir.github.io" target="_blank">NEUIR</a>, <a href="https://www.openbmb.cn/home" target="_blank">OpenBMB</a>, and <a href="https://github.com/AI9Stars" target="_blank">AI9Stars</a>. It is the first lightweight RAG system construction framework built on the Model Context Protocol (MCP) architecture, designed to provide efficient modeling support for scientific research and exploration. The framework offers a full suite of teaching examples from beginner to advanced levels, integrates 17 mainstream benchmark tasks and a wide range of high-quality baselines, combined with a unified evaluation system and knowledge base support, significantly improving system development efficiency and experiment reproducibility.
For more information, please visit our [GitHub repo](https://github.com/OpenBMB/UltraRAG) and [Tutorial Documentation](https://ultrarag.openbmb.cn). If you find this repository helpful for your research, please consider giving us a ⭐ to show your support.
## Dataset Card
UltraRAG 2.0 is ready to use out of the box, with native support for the most widely used **public benchmark datasets** and **large-scale corpora** in the RAG field, allowing researchers to quickly reproduce and extend experiments. We will also continue to integrate commonly used, high-quality datasets and corpora to further enhance research and application support.
### 1. Supported Datasets
| Task Type | Dataset Name | Original Data Size | Evaluation Sample Size |
|:------------------|:----------------------|:-------------------------------------------|:------------------------|
| QA | [NQ](https://huggingface.co/datasets/google-research-datasets/nq_open) | 3,610 | 1,000 |
| QA | [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) | 11,313 | 1,000 |
| QA | [PopQA](https://huggingface.co/datasets/akariasai/PopQA) | 14,267 | 1,000 |
| QA | [AmbigQA](https://huggingface.co/datasets/sewon/ambig_qa) | 2,002 | 1,000 |
| QA | [MarcoQA](https://huggingface.co/datasets/microsoft/ms_marco/viewer/v2.1/validation) | 55,636 | 1,000|
| QA | [WebQuestions](https://huggingface.co/datasets/stanfordnlp/web_questions) | 2,032 | 1,000 |
| VQA | [MP-DocVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-MP-DocVQA) | 591 | 591 |
| VQA | [ChartQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ChartQA) | 63 | 63 |
| VQA | [InfoVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-InfoVQA) | 718 | 718 |
| VQA | [PlotQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-PlotQA) | 863 | 863 |
| Multi-hop QA | [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) | 7,405 | 1,000 |
| Multi-hop QA | [2WikiMultiHopQA](https://www.dropbox.com/scl/fi/heid2pkiswhfaqr5g0piw/data.zip?e=2&file_subpath=%2Fdata&rlkey=ira57daau8lxfj022xvk1irju) | 12,576 | 1,000 |
| Multi-hop QA | [Musique](https://drive.google.com/file/d/1tGdADlNjWFaHLeZZGShh2IRcpO6Lv24h/view) | 2,417 | 1,000 |
| Multi-hop QA | [Bamboogle](https://huggingface.co/datasets/chiayewken/bamboogle) | 125 | 125 |
| Multi-hop QA | [StrategyQA](https://huggingface.co/datasets/tasksource/strategy-qa) | 2,290 | 1,000 |
| Multi-hop VQA | [SlideVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-SlideVQA) | 556 | 556 |
| Multiple-choice | [ARC](https://huggingface.co/datasets/allenai/ai2_arc) | 3,548 | 1,000 |
| Multiple-choice | [MMLU](https://huggingface.co/datasets/cais/mmlu) | 14,042 | 1,000 |
| Multiple-choice VQA | [ArXivQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ArxivQA) | 816 | 816 |
| Long-form QA | [ASQA](https://huggingface.co/datasets/din0s/asqa) | 948 | 948 |
| Fact-verification| [FEVER](https://fever.ai/dataset/fever.html) | 13,332 | 1,000 |
| Dialogue | [WoW](https://huggingface.co/datasets/facebook/kilt_tasks) | 3,054 | 1,000 |
| Slot-filling | [T-REx](https://huggingface.co/datasets/facebook/kilt_tasks) | 5,000 | 1,000 |
We provide two versions of each benchmark. The first is the vanilla version, which directly uses the official development or test set of the corresponding benchmark (noting that some datasets do not release test set labels). The second is the leaderboard version, a unified sampled version curated for our Leaderboard evaluation. You may choose either version according to your specific needs.
We have ensured maximum consistency with the original data and clearly annotated all sources. Below are special handling notes for certain datasets:
- MarcoQA: The original data includes unanswerable cases, which we have removed.
- Multiple-choice datasets: ARC options are labeled with uppercase letters A–E, though option E occurs only once. MMLU options are labeled with uppercase letters A–D. Please be mindful of this when designing prompts.
- ASQA: Short answers are used as labels, while long answers are retained in the meta_data field.
- FEVER: Only the “support” and “refute” labels are preserved.
**Data Format Specification**
To ensure full compatibility with all UltraRAG modules, users are advised to store test data in .jsonl format following the specifications below.
Non-multiple-choice data format:
```json icon="/images/json.svg"
{
"id": 0,
"question": "where does the karate kid 2010 take place",
"golden_answers": ["China", "Beijing", "Beijing, China"],
"meta_data": {}
}
```
Multiple-choice data format:
```json icon="/images/json.svg"
{
"id": 0,
"question": "Mast Co. converted from the FIFO method for inventory valuation to the LIFO method for financial statement and tax purposes. During a period of inflation would Mast's ending inventory and income tax payable using LIFO be higher or lower than FIFO? Ending inventory Income tax payable",
"golden_answers": ["A"],
"choices": ["Lower Lower", "Higher Higher", "Lower Higher", "Higher Lower"],
"meta_data": {"subject": "professional_accounting"}
}
```
---
### 2. Supported Corpora
| Corpus Name | Number of Documents |
|:-------------|:---------------------|
| Wiki-2018 | 21,015,324 |
| Wiki-2024 | 30,463,973 |
| MP-DocVQA | 741 |
| ChartQA | 500 |
| InfoVQA | 459 |
| PlotQA | 9,593 |
| SlideVQA | 1,284 |
| ArXivQA | 8,066 |
For Wiki-2018, we use the corpus version provided by [FlashRAG](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus), and we are also preparing a latest Wiki corpus for research use.
**Data Format Specification**
Text Corpus Format:
```json icon="/images/json.svg"
{
"id": "15106858",
"contents": "Arrowhead Stadium 1970s practice would eventually spread to the other NFL stadiums as the 1970s progressed, finally becoming mandatory league-wide in the 1978 season (after being used in Super Bowl XII), and become almost near-universal at the lower levels of football. On January 20, 1974, Arrowhead Stadium hosted the Pro Bowl. Due to an ice storm and brutally cold temperatures the week leading up to the game, the game's participants worked out at the facilities of the San Diego Chargers. On game day, the temperature soared to 41 F, melting most of the ice and snow that accumulated during the week. The AFC defeated the NFC, 15–13."
}
```
Image Corpus Format:
```json icon="/images/json.svg"
{
"id": 0,
"image_id": "37313.jpeg",
"image_path": "image/37313.jpg"
}
```
# UltraRAG 2.0:面向科学研究的检索增强生成加速框架
UltraRAG 2.0(简称UR-2.0)由清华大学自然语言处理实验室(THUNLP)、NEUIR实验室、OpenBMB与AI9Stars联合发布。它是首个基于模型上下文协议(Model Context Protocol, MCP)架构构建的轻量级RAG(Retrieval-Augmented Generation,检索增强生成)系统构建框架,旨在为科学研究与探索提供高效的建模支撑。该框架提供从入门到进阶的全套教学示例,集成了17项主流基准任务与大量高质量基线模型,搭配统一的评估体系与知识库支持,显著提升了系统开发效率与实验可复现性。
如需了解更多信息,请访问我们的[GitHub仓库](https://github.com/OpenBMB/UltraRAG)与[教程文档](https://ultrarag.openbmb.cn)。若本仓库对您的研究有所帮助,欢迎为我们点亮⭐以示支持。
## 数据集卡片(Dataset Card)
UltraRAG 2.0开箱即用,原生支持RAG领域最广泛使用的**公开基准数据集**与**大规模语料库**,可帮助研究人员快速复现并拓展实验。我们还将持续集成常用的高质量数据集与语料库,进一步强化对研究与应用的支撑能力。
### 1. 支持的数据集
| 任务类型 | 数据集名称 | 原始数据规模 | 评估样本量 |
|:------------------|:----------------------|:-------------------------------------------|:------------------------|
| QA(Question Answering,问答) | [NQ](https://huggingface.co/datasets/google-research-datasets/nq_open) | 3,610 | 1,000 |
| QA(Question Answering,问答) | [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) | 11,313 | 1,000 |
| QA(Question Answering,问答) | [PopQA](https://huggingface.co/datasets/akariasai/PopQA) | 14,267 | 1,000 |
| QA(Question Answering,问答) | [AmbigQA](https://huggingface.co/datasets/sewon/ambig_qa) | 2,002 | 1,000 |
| QA(Question Answering,问答) | [MarcoQA](https://huggingface.co/datasets/microsoft/ms_marco/viewer/v2.1/validation) | 55,636 | 1,000 |
| QA(Question Answering,问答) | [WebQuestions](https://huggingface.co/datasets/stanfordnlp/web_questions) | 2,032 | 1,000 |
| VQA(Visual Question Answering,视觉问答) | [MP-DocVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-MP-DocVQA) | 591 | 591 |
| VQA(Visual Question Answering,视觉问答) | [ChartQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ChartQA) | 63 | 63 |
| VQA(Visual Question Answering,视觉问答) | [InfoVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-InfoVQA) | 718 | 718 |
| VQA(Visual Question Answering,视觉问答) | [PlotQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-PlotQA) | 863 | 863 |
| 多跳问答(Multi-hop QA) | [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) | 7,405 | 1,000 |
| 多跳问答(Multi-hop QA) | [2WikiMultiHopQA](https://www.dropbox.com/scl/fi/heid2pkiswhfaqr5g0piw/data.zip?e=2&file_subpath=%2Fdata&rlkey=ira57daau8lxfj022xvk1irju) | 12,576 | 1,000 |
| 多跳问答(Multi-hop QA) | [Musique](https://drive.google.com/file/d/1tGdADlNjWFaHLeZZGShh2IRcpO6Lv24h/view) | 2,417 | 1,000 |
| 多跳问答(Multi-hop QA) | [Bamboogle](https://huggingface.co/datasets/chiayewken/bamboogle) | 125 | 125 |
| 多跳问答(Multi-hop QA) | [StrategyQA](https://huggingface.co/datasets/tasksource/strategy-qa) | 2,290 | 1,000 |
| 多跳视觉问答(Multi-hop VQA) | [SlideVQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-SlideVQA) | 556 | 556 |
| 多项选择(Multiple-choice) | [ARC](https://huggingface.co/datasets/allenai/ai2_arc) | 3,548 | 1,000 |
| 多项选择(Multiple-choice) | [MMLU](https://huggingface.co/datasets/cais/mmlu) | 14,042 | 1,000 |
| 多项选择视觉问答(Multiple-choice VQA) | [ArXivQA](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ArxivQA) | 816 | 816 |
| 长文本问答(Long-form QA) | [ASQA](https://huggingface.co/datasets/din0s/asqa) | 948 | 948 |
| 事实核查(Fact-verification) | [FEVER](https://fever.ai/dataset/fever.html) | 13,332 | 1,000 |
| 对话(Dialogue) | [WoW](https://huggingface.co/datasets/facebook/kilt_tasks) | 3,054 | 1,000 |
| 槽位填充(Slot-filling) | [T-REx](https://huggingface.co/datasets/facebook/kilt_tasks) | 5,000 | 1,000 |
我们为每个基准测试提供两个版本。第一个为原始版本,直接使用对应基准的官方开发集或测试集(需注意部分数据集未公开测试集标签)。第二个为排行榜版本,是为我们的排行榜评估精心整理的统一采样版本。您可根据具体需求选择任一版本。
我们已确保与原始数据保持最大一致性,并清晰标注了所有数据源。以下为部分数据集的特殊处理说明:
- MarcoQA:原始数据包含无回答案例,我们已将其移除。
- 多项选择数据集:ARC的选项以大写字母A–E标注(仅E出现过一次);MMLU的选项以大写字母A–D标注。设计提示词时请注意这一点。
- ASQA:以短答案作为标签,长答案则保留在`meta_data`字段中。
- FEVER:仅保留“支持”与“反驳”两类标签。
**数据格式规范**
为确保与UltraRAG所有模块完全兼容,建议用户按照以下规范将测试数据存储为`.jsonl`格式。
非多项选择数据格式:
json
{
"id": 0,
"question": "where does the karate kid 2010 take place",
"golden_answers": ["China", "Beijing", "Beijing, China"],
"meta_data": {}
}
多项选择数据格式:
json
{
"id": 0,
"question": "Mast Co. converted from the FIFO method for inventory valuation to the LIFO method for financial statement and tax purposes. During a period of inflation would Mast's ending inventory and income tax payable using LIFO be higher or lower than FIFO? Ending inventory Income tax payable",
"golden_answers": ["A"],
"choices": ["Lower Lower", "Higher Higher", "Lower Higher", "Higher Lower"],
"meta_data": {"subject": "professional_accounting"}
}
---
### 2. 支持的语料库
| 语料库名称 | 文档数量 |
|:-------------|:---------------------|
| Wiki-2018 | 21,015,324 |
| Wiki-2024 | 30,463,973 |
| MP-DocVQA | 741 |
| ChartQA | 500 |
| InfoVQA | 459 |
| PlotQA | 9,593 |
| SlideVQA | 1,284 |
| ArXivQA | 8,066 |
对于Wiki-2018,我们使用[FlashRAG](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/tree/main/retrieval-corpus)提供的语料库版本,同时我们也在筹备最新的维基百科语料库供研究使用。
**数据格式规范**
文本语料库格式:
json
{
"id": "15106858",
"contents": "Arrowhead Stadium 1970s practice would eventually spread to the other NFL stadiums as the 1970s progressed, finally becoming mandatory league-wide in the 1978 season (after being used in Super Bowl XII), and become almost near-universal at the lower levels of football. On January 20, 1974, Arrowhead Stadium hosted the Pro Bowl. Due to an ice storm and brutally cold temperatures the week leading up to the game, the game's participants worked out at the facilities of the San Diego Chargers. On game day, the temperature soared to 41 F, melting most of the ice and snow that accumulated during the week. The AFC defeated the NFC, 15–13."
}
图像语料库格式:
json
{
"id": 0,
"image_id": "37313.jpeg",
"image_path": "image/37313.jpg"
}
提供机构:
maas
创建时间:
2025-10-21



