frames-benchmark

Name: frames-benchmark
Creator: maas
Published: 2026-05-08 16:29:09
License: 暂无描述

魔搭社区2026-05-08 更新2024-10-05 收录

下载链接：

https://modelscope.cn/datasets/google/frames-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

# FRAMES: Factuality, Retrieval, And reasoning MEasurement Set FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning. Our paper with details and experiments is available on arXiv: [https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941). ## Dataset Overview - 824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles - Questions span diverse topics including history, sports, science, animals, health, etc. - Each question is labeled with reasoning types: numerical, tabular, multiple constraints, temporal, and post-processing - Gold answers and relevant Wikipedia articles provided for each question ## Key Features - Tests end-to-end RAG capabilities in a unified framework - Requires integration of information from multiple sources - Incorporates complex reasoning and temporal disambiguation - Designed to be challenging for state-of-the-art language models ## Usage This dataset can be used to: - Evaluate RAG system performance - Benchmark language model factuality and reasoning - Develop and test multi-hop retrieval strategies ## Baseline Results We provide baseline results using state-of-the-art models like Gemini-Pro-1.5-0514: - Naive prompting: 40.8% accuracy - BM25 retrieval (4 docs): 47.4% accuracy - Oracle retrieval: 72.9% accuracy - Multi-step retrieval & reasoning: 66% accuracy ## Citation If you use this dataset in your research, please cite our paper: ``` @misc{krishna2024factfetchreasonunified, title={Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation}, author={Satyapriya Krishna and Kalpesh Krishna and Anhad Mohananey and Steven Schwarcz and Adam Stambler and Shyam Upadhyay and Manaal Faruqui}, year={2024}, eprint={2409.12941}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.12941}, } ``` We hope FRAMES will be useful for advancing RAG systems and language model capabilities. For more details, please refer to our full paper.

# FRAMES：事实性、检索与推理评测集（Factuality, Retrieval, And reasoning MEasurement Set） FRAMES是一款综合性评测数据集，旨在从事实性、检索精度与推理能力三个维度测试检索增强生成（Retrieval-Augmented Generation，RAG）系统的性能。本研究的详细论文与实验内容已发布于arXiv：[https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941)。 ## 数据集概览 - 包含824道具有挑战性的多跳问题，所需信息来源覆盖2至15篇维基百科文章 - 问题主题涵盖历史、体育、科学、动物、健康等多个领域 - 每道问题均标注了推理类型：数值推理、表格推理、多约束推理、时序推理与后处理推理 - 为每道问题提供了标准答案与相关维基百科文章 ## 核心特性 - 在统一框架下测试端到端的RAG系统性能 - 要求整合多源信息完成推理任务 - 包含复杂推理与时序消歧挑战 - 对当前主流大语言模型（Large Language Model，LLM）具有较高难度 ## 使用场景该数据集可应用于以下场景： - 评测RAG系统的整体性能 - 基准测试大语言模型的事实性与推理能力 - 开发并验证多跳检索策略 ## 基准实验结果我们提供了使用Gemini-Pro-1.5-0514等当前主流模型得到的基准实验结果： - 朴素提示（Naive prompting）：准确率40.8% - BM25检索（返回4篇文档）：准确率47.4% - 神谕检索（Oracle retrieval）：准确率72.9% - 多步检索与推理：准确率66% ## 引用方式若您在研究工作中使用该数据集，请引用以下论文： @misc{krishna2024factfetchreasonunified, title={Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation}, author={Satyapriya Krishna and Kalpesh Krishna and Anhad Mohananey and Steven Schwarcz and Adam Stambler and Shyam Upadhyay and Manaal Faruqui}, year={2024}, eprint={2409.12941}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.12941}, } 我们期望FRAMES数据集能够为RAG系统与大语言模型的性能演进提供助力。如需了解更多细节，请参阅我们的完整论文。

提供机构：

maas

创建时间：

2025-04-21

搜集汇总

数据集介绍