frames-benchmark
收藏魔搭社区2026-05-08 更新2024-10-05 收录
下载链接:
https://modelscope.cn/datasets/google/frames-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
# FRAMES: Factuality, Retrieval, And reasoning MEasurement Set
FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning.
Our paper with details and experiments is available on arXiv: [https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941).
## Dataset Overview
- 824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles
- Questions span diverse topics including history, sports, science, animals, health, etc.
- Each question is labeled with reasoning types: numerical, tabular, multiple constraints, temporal, and post-processing
- Gold answers and relevant Wikipedia articles provided for each question
## Key Features
- Tests end-to-end RAG capabilities in a unified framework
- Requires integration of information from multiple sources
- Incorporates complex reasoning and temporal disambiguation
- Designed to be challenging for state-of-the-art language models
## Usage
This dataset can be used to:
- Evaluate RAG system performance
- Benchmark language model factuality and reasoning
- Develop and test multi-hop retrieval strategies
## Baseline Results
We provide baseline results using state-of-the-art models like Gemini-Pro-1.5-0514:
- Naive prompting: 40.8% accuracy
- BM25 retrieval (4 docs): 47.4% accuracy
- Oracle retrieval: 72.9% accuracy
- Multi-step retrieval & reasoning: 66% accuracy
## Citation
If you use this dataset in your research, please cite our paper:
```
@misc{krishna2024factfetchreasonunified,
title={Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation},
author={Satyapriya Krishna and Kalpesh Krishna and Anhad Mohananey and Steven Schwarcz and Adam Stambler and Shyam Upadhyay and Manaal Faruqui},
year={2024},
eprint={2409.12941},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.12941},
}
```
We hope FRAMES will be useful for advancing RAG systems and language model capabilities. For more details, please refer to our full paper.
# FRAMES:事实性、检索与推理评测集(Factuality, Retrieval, And reasoning MEasurement Set)
FRAMES是一款综合性评测数据集,旨在从事实性、检索精度与推理能力三个维度测试检索增强生成(Retrieval-Augmented Generation,RAG)系统的性能。本研究的详细论文与实验内容已发布于arXiv:[https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941)。
## 数据集概览
- 包含824道具有挑战性的多跳问题,所需信息来源覆盖2至15篇维基百科文章
- 问题主题涵盖历史、体育、科学、动物、健康等多个领域
- 每道问题均标注了推理类型:数值推理、表格推理、多约束推理、时序推理与后处理推理
- 为每道问题提供了标准答案与相关维基百科文章
## 核心特性
- 在统一框架下测试端到端的RAG系统性能
- 要求整合多源信息完成推理任务
- 包含复杂推理与时序消歧挑战
- 对当前主流大语言模型(Large Language Model,LLM)具有较高难度
## 使用场景
该数据集可应用于以下场景:
- 评测RAG系统的整体性能
- 基准测试大语言模型的事实性与推理能力
- 开发并验证多跳检索策略
## 基准实验结果
我们提供了使用Gemini-Pro-1.5-0514等当前主流模型得到的基准实验结果:
- 朴素提示(Naive prompting):准确率40.8%
- BM25检索(返回4篇文档):准确率47.4%
- 神谕检索(Oracle retrieval):准确率72.9%
- 多步检索与推理:准确率66%
## 引用方式
若您在研究工作中使用该数据集,请引用以下论文:
@misc{krishna2024factfetchreasonunified,
title={Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation},
author={Satyapriya Krishna and Kalpesh Krishna and Anhad Mohananey and Steven Schwarcz and Adam Stambler and Shyam Upadhyay and Manaal Faruqui},
year={2024},
eprint={2409.12941},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.12941},
}
我们期望FRAMES数据集能够为RAG系统与大语言模型的性能演进提供助力。如需了解更多细节,请参阅我们的完整论文。
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍

背景与挑战
背景概述
FRAMES数据集包含824个多跳问题,覆盖历史、体育、科学等多个主题,每个问题标注了推理类型,并提供黄金答案和相关维基百科文章。该数据集旨在评估RAG系统的端到端能力,包括多源信息整合和复杂推理。
以上内容由遇见数据集搜集并总结生成



