Name: AQ-MedAI/RAG-QA-Leaderboard
Creator: AQ-MedAI
Published: 2025-11-19 12:42:00
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/AQ-MedAI/RAG-QA-Leaderboard

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering language: - en configs: - config_name: 2wiki data_files: final_data/2wiki.jsonl description: 评测集：2WikiMultiHopQA - config_name: hotpotqa data_files: final_data/hotpot_distractor.jsonl description: 评测集：HotpotQA (distractor setting) - config_name: musique data_files: final_data/musique.jsonl description: 评测集：Musique - config_name: popqa data_files: final_data/popqa.jsonl description: 评测集：PopQA - config_name: trivialqa data_files: final_data/triviaqa.jsonl description: 评测集：TriviaQA - config_name: pubmedqa data_files: final_data/pubmed.jsonl description: 评测集：PubMedQA - config_name: documents_pool data_files: final_data/documents_pool.jsonl description: 用于检索的文档池 tags: - rag - medical - ragbench - hotpotqa - 2wiki - musique - trivialqa - popqa - pubmedqa --- # Dataset Description This collection includes **6 widely-used datasets** for open-domain question answering and retrieval evaluation: `2WikiMultihopQA`, `HotpotQA`,`Musique`,`PopQA`,`TrivialQA`,`PubMedQA` Our evaluation code is at https://github.com/AQ-MedAI/RagQALeaderboard. # Leaderboard Overall Performance of Different Models on Various Tasks: | Model | AVG | Multi-hop | Single-hop | Medical Domain | |-----------------------------------------|------|-----------|------------|----------------| | DeepSeekR1-0528 | **79.5** | 80 | 92.4 | 66 | | GPT-4.1-2025-04-14 | 78.8 | 81.6 | 92.8 | 62 | | Baichuan-M2-32B-Think | 77.6 | 79.9 | 95 | 57.8 | | Meta-Llama-3-70B | 76.2 | 71.2 | 88.5 | **69** | | Gemma-3-27B-Instruct | 74.8 | 71.8 | 93.3 | 59.2 | | DeepSeek-V3.2-Exp | 74.3 | 75.1 | 91.8 | 56.2 | | Qwen3_Next_80B_Instruct | 74.2 | **82.5** | **93.9** | 46.2 | | Qwen3-235B-A22B-Instruct-2507 | 73.9 | 77.7 | 90.7 | 53.2 | | Kimi-K2-Instruct | 72.2 | 76.1 | 90.3 | 50.2 | | Qwen3-30B-A3B-Instruct-2507 | 72.0 | 73.0 | 90.5 | 52.4 | | Meta-Llama-3-8B | 70.8 | 64.4 | 79.8 | 68.2 | | Qwen3-235b-A22B-Nothink | 69.8 | 72.6 | 87.7 | 49 | | PA-RAG_Meta-Llama-3-8B-Instruct | 65.5 | 60.2 | 74.9 | 61.4 | | Gemma-3-12B-Instruct | 64.9 | 65.4 | 88.3 | 41 | | Hunyuan 80B-A13B-Instruct | 63.8 | 68 | 85.3 | 38.2 | | Qwen3-30B-A3B-Nothink | 63.2 | 63 | 88.3 | 38.4 | Performance of Different Models on Specific Datasets | MODEL | 2wiki | hotpotqa | musique | single-hop | tqa | pqa | pubmedqa | |-------------------------------------|-------|----------|---------|------------|------|------|----------| | DeepSeekR1-0528 | 87.4 | 83.2 | 69.4 | 92.4 | 93.8 | 91.0 | 66.0 | | GPT-4.1-2025-04-14 | 88.8 | 83.0 | 72.9 | 92.8 | **95.5** | 90.1 | 62.0 | | Baichuan-M2-32B-Think | 86.4 | **86.4** | 66.9 | 95.0 | 96.1 | **93.8** | 57.8 | | Meta-Llama-3-70B | 80.3 | 76.7 | 56.7 | 88.5 | 94.0 | 83.0 | **69.0** | | Gemma-3-27B-Instruct | 77.2 | 79.3 | 58.9 | 93.3 | 94.5 | 92.0 | 59.2 | | DeepSeek-V3.2-Exp | 83.4 | 80.4 | 61.4 | 91.8 | 93.5 | 90.0 | 56.2 | | Qwen3_Next_80B_Instruct | **92.5** | 84.6 | **70.4** | **93.9** | 95.0 | 92.7 | 46.2 | | Qwen3-235B-A22B-Instruct-2507 | 84.9 | 82.8 | 65.3 | 90.7 | 93.8 | 87.6 | 53.2 | | Kimi-K2-Instruct | 81.7 | 78.5 | 68.1 | 90.3 | 92.8 | 87.7 | 50.2 | | Qwen3-30B-A3B-Instruct-2507 | 81.4 | 81.9 | 55.8 | 90.5 | 94.2 | 86.7 | 52.4 | | Meta-Llama-3-8B | 61.5 | 63.6 | 68.2 | 79.8 | 88.7 | 70.9 | 68.2 | | Qwen3-235b-A22B-Nothink | 81.5 | 77.0 | 59.2 | 87.7 | 93.3 | 82.0 | 49.0 | | PA-RAG_Meta-Llama-3-8B-Instruct | 68.5 | 68.1 | 44.0 | 74.9 | 85.3 | 64.4 | 61.4 | | Gemma-3-12B-Instruct | 72.5 | 73.9 | 49.8 | 88.3 | 92.3 | 84.2 | 41.0 | | Hunyuan 80B-A13B-Instruct | 78.6 | 75.3 | 50.1 | 85.3 | 89.6 | 81.0 | 38.2 | | Qwen3-30B-A3B-Nothink | 71.0 | 73.3 | 44.6 | 88.3 | 89.7 | 86.8 | 38.4 | Currently, in the Medical Domain, we have selected relatively few datasets, so the evaluation might contain some randomness. In the future, we plan to include more related datasets. Additionally, we will continue to evaluate more models. # Inference ## Installation ``` git clone https://github.com/AQ-MedAI/RagQALeaderboard cd RagQALeaderboard/ pip install -r requirements.txt # Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]" hf download AQ-MedAI/RAG-OmniQA --repo-type=dataset ``` ## Run Evaluation ``` python eval.py --model-name "Qwen3" --model-path "/path/to/model" --eval-dataset hotpotqa popqa ``` **Customize Configuration**: You can modify the configuration files in the config/ directory (e.g., api_prompt_config_en.json) to customize evaluation parameters. **Generate Report**: After evaluation, HTML reports and JSON results will be saved in the reports/ directory. For more details, pls see our github repo https://github.com/AQ-MedAI/RagQALeaderboard. ## Dataset Each dataset contains the following fields: - `query`: The input question or query. - `groundtruth`: The correct answer(s) to the query. - `golden_docs`: Documents that contain the evidence or support for the correct answer. - `noise_docs`: Distractor documents that are related to the query but do not contain the correct answer. This structure enables evaluation of both retrieval accuracy and answer generation performance in multi-hop and single-hop reasoning scenarios. ## Document Pool We also provide a unified `documents_pool` derived from Wikipedia, serving as a retrieval corpus. This pool has been pre-processed using **Contriever** for initial retrieval, making it efficient and convenient for training and evaluating retrieval models. The document pool supports plug-and-play integration with standard retrieval and QA pipelines, allowing researchers to perform end-to-end experiments with minimal setup. ## Dataset Structure The dataset files are located inside the `final_data` folder. ```text . ├── final_data/ │ ├── 2wiki.jsonl │ ├── documents_pool.json │ ├── hotpot_distractor.jsonl │ ├── musique.jsonl │ ├── popqa.jsonl │ ├── pubmed.jsonl │ └── triviaqa.jsonl └── README.md ``` ## How to Use You can use the code as below. https://github.com/AQ-MedAI/RagQALeaderboard

应用场景：