five

frames-benchmark

收藏
魔搭社区2026-05-08 更新2024-10-05 收录
下载链接:
https://modelscope.cn/datasets/google/frames-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
# FRAMES: Factuality, Retrieval, And reasoning MEasurement Set FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning. Our paper with details and experiments is available on arXiv: [https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941). ## Dataset Overview - 824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles - Questions span diverse topics including history, sports, science, animals, health, etc. - Each question is labeled with reasoning types: numerical, tabular, multiple constraints, temporal, and post-processing - Gold answers and relevant Wikipedia articles provided for each question ## Key Features - Tests end-to-end RAG capabilities in a unified framework - Requires integration of information from multiple sources - Incorporates complex reasoning and temporal disambiguation - Designed to be challenging for state-of-the-art language models ## Usage This dataset can be used to: - Evaluate RAG system performance - Benchmark language model factuality and reasoning - Develop and test multi-hop retrieval strategies ## Baseline Results We provide baseline results using state-of-the-art models like Gemini-Pro-1.5-0514: - Naive prompting: 40.8% accuracy - BM25 retrieval (4 docs): 47.4% accuracy - Oracle retrieval: 72.9% accuracy - Multi-step retrieval & reasoning: 66% accuracy ## Citation If you use this dataset in your research, please cite our paper: ``` @misc{krishna2024factfetchreasonunified, title={Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation}, author={Satyapriya Krishna and Kalpesh Krishna and Anhad Mohananey and Steven Schwarcz and Adam Stambler and Shyam Upadhyay and Manaal Faruqui}, year={2024}, eprint={2409.12941}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.12941}, } ``` We hope FRAMES will be useful for advancing RAG systems and language model capabilities. For more details, please refer to our full paper.

# FRAMES:事实性、检索与推理评测集(Factuality, Retrieval, And reasoning MEasurement Set) FRAMES是一款综合性评测数据集,旨在从事实性、检索精度与推理能力三个维度测试检索增强生成(Retrieval-Augmented Generation,RAG)系统的性能。本研究的详细论文与实验内容已发布于arXiv:[https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941)。 ## 数据集概览 - 包含824道具有挑战性的多跳问题,所需信息来源覆盖2至15篇维基百科文章 - 问题主题涵盖历史、体育、科学、动物、健康等多个领域 - 每道问题均标注了推理类型:数值推理、表格推理、多约束推理、时序推理与后处理推理 - 为每道问题提供了标准答案与相关维基百科文章 ## 核心特性 - 在统一框架下测试端到端的RAG系统性能 - 要求整合多源信息完成推理任务 - 包含复杂推理与时序消歧挑战 - 对当前主流大语言模型(Large Language Model,LLM)具有较高难度 ## 使用场景 该数据集可应用于以下场景: - 评测RAG系统的整体性能 - 基准测试大语言模型的事实性与推理能力 - 开发并验证多跳检索策略 ## 基准实验结果 我们提供了使用Gemini-Pro-1.5-0514等当前主流模型得到的基准实验结果: - 朴素提示(Naive prompting):准确率40.8% - BM25检索(返回4篇文档):准确率47.4% - 神谕检索(Oracle retrieval):准确率72.9% - 多步检索与推理:准确率66% ## 引用方式 若您在研究工作中使用该数据集,请引用以下论文: @misc{krishna2024factfetchreasonunified, title={Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation}, author={Satyapriya Krishna and Kalpesh Krishna and Anhad Mohananey and Steven Schwarcz and Adam Stambler and Shyam Upadhyay and Manaal Faruqui}, year={2024}, eprint={2409.12941}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.12941}, } 我们期望FRAMES数据集能够为RAG系统与大语言模型的性能演进提供助力。如需了解更多细节,请参阅我们的完整论文。
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
FRAMES数据集包含824个多跳问题,覆盖历史、体育、科学等多个主题,每个问题标注了推理类型,并提供黄金答案和相关维基百科文章。该数据集旨在评估RAG系统的端到端能力,包括多源信息整合和复杂推理。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作