five

AQ-MedAI/RAG-QA-Leaderboard

收藏
Hugging Face2025-11-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/AQ-MedAI/RAG-QA-Leaderboard
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering language: - en configs: - config_name: 2wiki data_files: final_data/2wiki.jsonl description: 评测集:2WikiMultiHopQA - config_name: hotpotqa data_files: final_data/hotpot_distractor.jsonl description: 评测集:HotpotQA (distractor setting) - config_name: musique data_files: final_data/musique.jsonl description: 评测集:Musique - config_name: popqa data_files: final_data/popqa.jsonl description: 评测集:PopQA - config_name: trivialqa data_files: final_data/triviaqa.jsonl description: 评测集:TriviaQA - config_name: pubmedqa data_files: final_data/pubmed.jsonl description: 评测集:PubMedQA - config_name: documents_pool data_files: final_data/documents_pool.jsonl description: 用于检索的文档池 tags: - rag - medical - ragbench - hotpotqa - 2wiki - musique - trivialqa - popqa - pubmedqa --- # Dataset Description This collection includes **6 widely-used datasets** for open-domain question answering and retrieval evaluation: `2WikiMultihopQA`, `HotpotQA`,`Musique`,`PopQA`,`TrivialQA`,`PubMedQA` Our evaluation code is at https://github.com/AQ-MedAI/RagQALeaderboard. # Leaderboard Overall Performance of Different Models on Various Tasks: | Model | AVG | Multi-hop | Single-hop | Medical Domain | |-----------------------------------------|------|-----------|------------|----------------| | DeepSeekR1-0528 | **79.5** | 80 | 92.4 | 66 | | GPT-4.1-2025-04-14 | 78.8 | 81.6 | 92.8 | 62 | | Baichuan-M2-32B-Think | 77.6 | 79.9 | 95 | 57.8 | | Meta-Llama-3-70B | 76.2 | 71.2 | 88.5 | **69** | | Gemma-3-27B-Instruct | 74.8 | 71.8 | 93.3 | 59.2 | | DeepSeek-V3.2-Exp | 74.3 | 75.1 | 91.8 | 56.2 | | Qwen3_Next_80B_Instruct | 74.2 | **82.5** | **93.9** | 46.2 | | Qwen3-235B-A22B-Instruct-2507 | 73.9 | 77.7 | 90.7 | 53.2 | | Kimi-K2-Instruct | 72.2 | 76.1 | 90.3 | 50.2 | | Qwen3-30B-A3B-Instruct-2507 | 72.0 | 73.0 | 90.5 | 52.4 | | Meta-Llama-3-8B | 70.8 | 64.4 | 79.8 | 68.2 | | Qwen3-235b-A22B-Nothink | 69.8 | 72.6 | 87.7 | 49 | | PA-RAG_Meta-Llama-3-8B-Instruct | 65.5 | 60.2 | 74.9 | 61.4 | | Gemma-3-12B-Instruct | 64.9 | 65.4 | 88.3 | 41 | | Hunyuan 80B-A13B-Instruct | 63.8 | 68 | 85.3 | 38.2 | | Qwen3-30B-A3B-Nothink | 63.2 | 63 | 88.3 | 38.4 | Performance of Different Models on Specific Datasets | MODEL | 2wiki | hotpotqa | musique | single-hop | tqa | pqa | pubmedqa | |-------------------------------------|-------|----------|---------|------------|------|------|----------| | DeepSeekR1-0528 | 87.4 | 83.2 | 69.4 | 92.4 | 93.8 | 91.0 | 66.0 | | GPT-4.1-2025-04-14 | 88.8 | 83.0 | 72.9 | 92.8 | **95.5** | 90.1 | 62.0 | | Baichuan-M2-32B-Think | 86.4 | **86.4** | 66.9 | 95.0 | 96.1 | **93.8** | 57.8 | | Meta-Llama-3-70B | 80.3 | 76.7 | 56.7 | 88.5 | 94.0 | 83.0 | **69.0** | | Gemma-3-27B-Instruct | 77.2 | 79.3 | 58.9 | 93.3 | 94.5 | 92.0 | 59.2 | | DeepSeek-V3.2-Exp | 83.4 | 80.4 | 61.4 | 91.8 | 93.5 | 90.0 | 56.2 | | Qwen3_Next_80B_Instruct | **92.5** | 84.6 | **70.4** | **93.9** | 95.0 | 92.7 | 46.2 | | Qwen3-235B-A22B-Instruct-2507 | 84.9 | 82.8 | 65.3 | 90.7 | 93.8 | 87.6 | 53.2 | | Kimi-K2-Instruct | 81.7 | 78.5 | 68.1 | 90.3 | 92.8 | 87.7 | 50.2 | | Qwen3-30B-A3B-Instruct-2507 | 81.4 | 81.9 | 55.8 | 90.5 | 94.2 | 86.7 | 52.4 | | Meta-Llama-3-8B | 61.5 | 63.6 | 68.2 | 79.8 | 88.7 | 70.9 | 68.2 | | Qwen3-235b-A22B-Nothink | 81.5 | 77.0 | 59.2 | 87.7 | 93.3 | 82.0 | 49.0 | | PA-RAG_Meta-Llama-3-8B-Instruct | 68.5 | 68.1 | 44.0 | 74.9 | 85.3 | 64.4 | 61.4 | | Gemma-3-12B-Instruct | 72.5 | 73.9 | 49.8 | 88.3 | 92.3 | 84.2 | 41.0 | | Hunyuan 80B-A13B-Instruct | 78.6 | 75.3 | 50.1 | 85.3 | 89.6 | 81.0 | 38.2 | | Qwen3-30B-A3B-Nothink | 71.0 | 73.3 | 44.6 | 88.3 | 89.7 | 86.8 | 38.4 | Currently, in the Medical Domain, we have selected relatively few datasets, so the evaluation might contain some randomness. In the future, we plan to include more related datasets. Additionally, we will continue to evaluate more models. # Inference ## Installation ``` git clone https://github.com/AQ-MedAI/RagQALeaderboard cd RagQALeaderboard/ pip install -r requirements.txt # Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]" hf download AQ-MedAI/RAG-OmniQA --repo-type=dataset ``` ## Run Evaluation ``` python eval.py --model-name "Qwen3" --model-path "/path/to/model" --eval-dataset hotpotqa popqa ``` **Customize Configuration**: You can modify the configuration files in the config/ directory (e.g., api_prompt_config_en.json) to customize evaluation parameters. **Generate Report**: After evaluation, HTML reports and JSON results will be saved in the reports/ directory. For more details, pls see our github repo https://github.com/AQ-MedAI/RagQALeaderboard. ## Dataset Each dataset contains the following fields: - `query`: The input question or query. - `groundtruth`: The correct answer(s) to the query. - `golden_docs`: Documents that contain the evidence or support for the correct answer. - `noise_docs`: Distractor documents that are related to the query but do not contain the correct answer. This structure enables evaluation of both retrieval accuracy and answer generation performance in multi-hop and single-hop reasoning scenarios. ## Document Pool We also provide a unified `documents_pool` derived from Wikipedia, serving as a retrieval corpus. This pool has been pre-processed using **Contriever** for initial retrieval, making it efficient and convenient for training and evaluating retrieval models. The document pool supports plug-and-play integration with standard retrieval and QA pipelines, allowing researchers to perform end-to-end experiments with minimal setup. ## Dataset Structure The dataset files are located inside the `final_data` folder. ```text . ├── final_data/ │ ├── 2wiki.jsonl │ ├── documents_pool.json │ ├── hotpot_distractor.jsonl │ ├── musique.jsonl │ ├── popqa.jsonl │ ├── pubmed.jsonl │ └── triviaqa.jsonl └── README.md ``` ## How to Use You can use the code as below. https://github.com/AQ-MedAI/RagQALeaderboard
提供机构:
AQ-MedAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作