AQ-MedAI/RAG-QA-Leaderboard
收藏Hugging Face2025-11-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/AQ-MedAI/RAG-QA-Leaderboard
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
language:
- en
configs:
- config_name: 2wiki
data_files: final_data/2wiki.jsonl
description: 评测集:2WikiMultiHopQA
- config_name: hotpotqa
data_files: final_data/hotpot_distractor.jsonl
description: 评测集:HotpotQA (distractor setting)
- config_name: musique
data_files: final_data/musique.jsonl
description: 评测集:Musique
- config_name: popqa
data_files: final_data/popqa.jsonl
description: 评测集:PopQA
- config_name: trivialqa
data_files: final_data/triviaqa.jsonl
description: 评测集:TriviaQA
- config_name: pubmedqa
data_files: final_data/pubmed.jsonl
description: 评测集:PubMedQA
- config_name: documents_pool
data_files: final_data/documents_pool.jsonl
description: 用于检索的文档池
tags:
- rag
- medical
- ragbench
- hotpotqa
- 2wiki
- musique
- trivialqa
- popqa
- pubmedqa
---
# Dataset Description
This collection includes **6 widely-used datasets** for open-domain question answering and retrieval evaluation:
`2WikiMultihopQA`, `HotpotQA`,`Musique`,`PopQA`,`TrivialQA`,`PubMedQA`
Our evaluation code is at https://github.com/AQ-MedAI/RagQALeaderboard.
# Leaderboard
Overall Performance of Different Models on Various Tasks:
| Model | AVG | Multi-hop | Single-hop | Medical Domain |
|-----------------------------------------|------|-----------|------------|----------------|
| DeepSeekR1-0528 | **79.5** | 80 | 92.4 | 66 |
| GPT-4.1-2025-04-14 | 78.8 | 81.6 | 92.8 | 62 |
| Baichuan-M2-32B-Think | 77.6 | 79.9 | 95 | 57.8 |
| Meta-Llama-3-70B | 76.2 | 71.2 | 88.5 | **69** |
| Gemma-3-27B-Instruct | 74.8 | 71.8 | 93.3 | 59.2 |
| DeepSeek-V3.2-Exp | 74.3 | 75.1 | 91.8 | 56.2 |
| Qwen3_Next_80B_Instruct | 74.2 | **82.5** | **93.9** | 46.2 |
| Qwen3-235B-A22B-Instruct-2507 | 73.9 | 77.7 | 90.7 | 53.2 |
| Kimi-K2-Instruct | 72.2 | 76.1 | 90.3 | 50.2 |
| Qwen3-30B-A3B-Instruct-2507 | 72.0 | 73.0 | 90.5 | 52.4 |
| Meta-Llama-3-8B | 70.8 | 64.4 | 79.8 | 68.2 |
| Qwen3-235b-A22B-Nothink | 69.8 | 72.6 | 87.7 | 49 |
| PA-RAG_Meta-Llama-3-8B-Instruct | 65.5 | 60.2 | 74.9 | 61.4 |
| Gemma-3-12B-Instruct | 64.9 | 65.4 | 88.3 | 41 |
| Hunyuan 80B-A13B-Instruct | 63.8 | 68 | 85.3 | 38.2 |
| Qwen3-30B-A3B-Nothink | 63.2 | 63 | 88.3 | 38.4 |
Performance of Different Models on Specific Datasets
| MODEL | 2wiki | hotpotqa | musique | single-hop | tqa | pqa | pubmedqa |
|-------------------------------------|-------|----------|---------|------------|------|------|----------|
| DeepSeekR1-0528 | 87.4 | 83.2 | 69.4 | 92.4 | 93.8 | 91.0 | 66.0 |
| GPT-4.1-2025-04-14 | 88.8 | 83.0 | 72.9 | 92.8 | **95.5** | 90.1 | 62.0 |
| Baichuan-M2-32B-Think | 86.4 | **86.4** | 66.9 | 95.0 | 96.1 | **93.8** | 57.8 |
| Meta-Llama-3-70B | 80.3 | 76.7 | 56.7 | 88.5 | 94.0 | 83.0 | **69.0** |
| Gemma-3-27B-Instruct | 77.2 | 79.3 | 58.9 | 93.3 | 94.5 | 92.0 | 59.2 |
| DeepSeek-V3.2-Exp | 83.4 | 80.4 | 61.4 | 91.8 | 93.5 | 90.0 | 56.2 |
| Qwen3_Next_80B_Instruct | **92.5** | 84.6 | **70.4** | **93.9** | 95.0 | 92.7 | 46.2 |
| Qwen3-235B-A22B-Instruct-2507 | 84.9 | 82.8 | 65.3 | 90.7 | 93.8 | 87.6 | 53.2 |
| Kimi-K2-Instruct | 81.7 | 78.5 | 68.1 | 90.3 | 92.8 | 87.7 | 50.2 |
| Qwen3-30B-A3B-Instruct-2507 | 81.4 | 81.9 | 55.8 | 90.5 | 94.2 | 86.7 | 52.4 |
| Meta-Llama-3-8B | 61.5 | 63.6 | 68.2 | 79.8 | 88.7 | 70.9 | 68.2 |
| Qwen3-235b-A22B-Nothink | 81.5 | 77.0 | 59.2 | 87.7 | 93.3 | 82.0 | 49.0 |
| PA-RAG_Meta-Llama-3-8B-Instruct | 68.5 | 68.1 | 44.0 | 74.9 | 85.3 | 64.4 | 61.4 |
| Gemma-3-12B-Instruct | 72.5 | 73.9 | 49.8 | 88.3 | 92.3 | 84.2 | 41.0 |
| Hunyuan 80B-A13B-Instruct | 78.6 | 75.3 | 50.1 | 85.3 | 89.6 | 81.0 | 38.2 |
| Qwen3-30B-A3B-Nothink | 71.0 | 73.3 | 44.6 | 88.3 | 89.7 | 86.8 | 38.4 |
Currently, in the Medical Domain, we have selected relatively few datasets, so the evaluation might contain some randomness. In the future, we plan to include more related datasets. Additionally, we will continue to evaluate more models.
# Inference
## Installation
```
git clone https://github.com/AQ-MedAI/RagQALeaderboard
cd RagQALeaderboard/
pip install -r requirements.txt
# Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]"
hf download AQ-MedAI/RAG-OmniQA --repo-type=dataset
```
## Run Evaluation
```
python eval.py --model-name "Qwen3" --model-path "/path/to/model" --eval-dataset hotpotqa popqa
```
**Customize Configuration**: You can modify the configuration files in the config/ directory (e.g., api_prompt_config_en.json) to customize evaluation parameters.
**Generate Report**: After evaluation, HTML reports and JSON results will be saved in the reports/ directory.
For more details, pls see our github repo https://github.com/AQ-MedAI/RagQALeaderboard.
## Dataset
Each dataset contains the following fields:
- `query`: The input question or query.
- `groundtruth`: The correct answer(s) to the query.
- `golden_docs`: Documents that contain the evidence or support for the correct answer.
- `noise_docs`: Distractor documents that are related to the query but do not contain the correct answer.
This structure enables evaluation of both retrieval accuracy and answer generation performance in multi-hop and single-hop reasoning scenarios.
## Document Pool
We also provide a unified `documents_pool` derived from Wikipedia, serving as a retrieval corpus. This pool has been pre-processed using **Contriever** for initial retrieval, making it efficient and convenient for training and evaluating retrieval models.
The document pool supports plug-and-play integration with standard retrieval and QA pipelines, allowing researchers to perform end-to-end experiments with minimal setup.
## Dataset Structure
The dataset files are located inside the `final_data` folder.
```text
.
├── final_data/
│ ├── 2wiki.jsonl
│ ├── documents_pool.json
│ ├── hotpot_distractor.jsonl
│ ├── musique.jsonl
│ ├── popqa.jsonl
│ ├── pubmed.jsonl
│ └── triviaqa.jsonl
└── README.md
```
## How to Use
You can use the code as below.
https://github.com/AQ-MedAI/RagQALeaderboard
提供机构:
AQ-MedAI



