financebench

Name: financebench
Creator: maas
Published: 2025-11-27 16:34:23
License: 暂无描述

魔搭社区2025-11-27 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/PatronusAI/financebench

下载链接

链接失效反馈

官方服务：

资源简介：

[FinanceBench](https://hf.co/papers/2311.11944) is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). This is an open source sample of 150 annotated examples used in the evaluation and analysis of models assessed in the FinanceBench paper. The PDFs linked in the dataset can be found here as well: [https://github.com/patronus-ai/financebench/tree/main/pdfs](https://github.com/patronus-ai/financebench/tree/main/pdfs) The dataset comprises of questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We find that existing LLMs have clear limitations for financial QA. All models assessed exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises. To evaluate your models on the full dataset, or if you have questions about this work, you can email us at contact@patronus.ai

[FinanceBench](https://hf.co/papers/2311.11944) 是全球首个用于评估大语言模型（Large Language Model，LLM）开卷式金融问答（Question Answering，QA）性能的测试套件。本数据集为FinanceBench论文中用于评估与分析受试模型的150个带标注示例的开源样本。数据集中附带的PDF文件亦可在此处获取：[https://github.com/patronus-ai/financebench/tree/main/pdfs](https://github.com/patronus-ai/financebench/tree/main/pdfs) 该数据集包含针对上市公司的问题，以及对应的答案与证据字符串。FinanceBench中的问题均具备良好生态效度，涵盖多样化的应用场景；其设计旨在简洁明确、易于作答，以此作为模型性能的最低基准。我们基于FinanceBench的150个案例样本，对16种当前前沿的模型配置（含GPT-4-Turbo、Llama2、Claude2，搭配向量存储与长上下文提示）开展了测试，并对总计2400份模型输出答案进行了人工复核。该案例样本均以开源形式发布。研究发现，现有大语言模型在金融问答任务中存在显著局限：所有参与评估的模型均存在诸如幻觉生成等缺陷，这限制了其在企业场景中的应用适配性。若需在完整数据集上评估您的模型，或对本研究存在疑问，可发送邮件至contact@patronus.ai与我们取得联系。

提供机构：

maas

创建时间：

2025-05-20

搜集汇总

数据集介绍